Fault detection and prediction for management of computer networks

ABSTRACT

An improved system and method for network fault and anomaly detection is provided based on the statistical behavior of the management information base (MIB) variables. The statistical and temporal information at the variable level is obtained from the sensors associated with the MIB variables. Each sensor performs sequential hypothesis testing based on the Generalized Likelihood Ratio (GLR) test. The ouputs of the individual sensors are combined using a fusion center, which incorporates the interdependencies of the MIB variables. The fusion center provides temporally correlated alarms that are indicative of network problems. The detection scheme relies on traffic measurement and is independent of specific fault descriptions.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates generally to the field of networkmanagement. More specifically, this invention relates to a system fornetwork fault detection and prediction utilizing statistical behavior ofManagement Information Base (MIB) variables.

[0003] 2. Description of Prior Art

[0004] Prediction of network faults, anomalies and performancedegradation form an important component of network management. Thisfeature is essential to provide a reliable network along with real-timequality of service (QoS) guarantees. The advent of real-time services onthe network creates a need for continuous monitoring and prediction ofnetwork performance and reliability. Although faults are rare events,when they do occur, they can have enormous consequences. Yet therareness of network faults makes their study difficult. Performanceproblems occur more often and in some cases may be considered asindicators of an impending fault. Efficient handling of theseperformance issues may help eliminate the occurrence of severe faults.

[0005] Most of the work done in the area of network fault detection canbe classified under the general area of alarm correlation. Severalapproaches have been used to model alarm sequences that occur during andbefore fault events. The goal behind alarm correlation is to obtainfault identification and diagnosis. The sequence of alarms obtained fromthe different points in the network are modeled as the states of afinite state machine. The transitions between the states are measuredusing prior events. The difficulty encountered in using this method isthat not all faults can be captured by a finite sequence of alarms ofreasonable length. This causes the number of states required to exploreas a function of the number and complexity of faults modeled.Furthermore, the number of parameters to be learned increases, and theseparameters may not remain constant as the network evolves. Accountingfor this variability would require extensive off-line learning beforethe scheme can be deployed on the network. More importantly, there is anunderlying assumption that the alarms obtained are true. No attempt ismade to generate the individual alarms themselves.

[0006] Another method of generating alarms is the trouble ticketingsystem used by several of the commercial network management packages. Atrouble ticket is a qualitative description of the symptoms of a faultor performance problem as perceived by a user or a network manager. Inthis method there is no guarantee of the accuracy of the temporalinformation. Also, the user may not be able to describe all aspects ofthe problem accurately enough to initiate appropriate recovery methods.

[0007] Syslog messages are also widely used as sources of alarms.However, these messages are difficult to comprehend and synthesize.There are also large volumes of syslog messages generated in any givennetwork and they are often reactive to a network problem. This reactivenature precludes the use of these messages for predictive alarmgeneration.

[0008] Early work in the area of fault detection was based on expertsystems. In expert systems an exhaustive database containing the rulesof behavior of the faulty system is used to determine if a faultoccurred. These rule-based systems rely heavily on the expertise of thenetwork manager. The rules are dependent on prior knowledge about thefault conditions on the network and do not adapt well to the evolvingnetwork environment. Thus, it is possible that entirely new faults mayescape detection. Furthermore, even for a stable network, there are noguarantees that an exhaustive database has been created.

[0009] In contrast, case-based reasoning is an extension of rule-basedsystems and it differs from detection based on expert systems in that,in addition to just rules, a picture of the previous fault scenarios isused to make the decisions. A picture in this sense refers to thecircumstances or events that led to the fault. These descriptions of thefault cases also suffer from the heavy dependence on past information.In order to adapt the scheme to the changing network environment,adaptive learning techniques are used to obtain the functionaldependence of relevant criteria such as network load, collision rate,etc, to previous trouble tickets available in the database. But usingany functional approximation scheme, such as back propagation, causes anincrease in computation time and complexity. The identification ofrelevant criteria for the different faults will in turn require a set ofrules to be developed. The number of functions to be learned alsoincreases with the number of faults studied.

[0010] Another method is the adaptive thresholding scheme which is thebasis of most commercially available online network management tools.Thresholds are set to adapt to the changing behavior of network fault.These methods are primarily based on the second-order statistics (meanand variance) of the traffic. However, network traffic has been shown tohave complex patterns and it is becoming increasingly clear that thesecond-order statistics alone may not be sufficient to capture thetraffic behavior over long periods of time. These methods can, at best,detect only severe failures or performance issues such as a broken linkor a significant loss of link capacity. Hence, using adaptivethresholding based on second-order statistics, the changes in trafficbehavior that are indicative of impending network problems (e.g., fileserver crashes) cannot be detected, precluding the possibility ofprediction. In adaptive thresholding, the challenge is to identify theoptimal settings of the threshold in the presence of evolving networktraffic whose characteristics are intrinsically heterogeneous andstochastic.

[0011] Further, there are some inherent difficulties encountered whenworking in the area of network fault detection. The evolving nature ofIP networks, both in terms of the size and also the variety of networkcomponents and services, makes it difficult to fully understand thedynamics of the traffic on the network. Network traffic itself has beenshown to be composed of complex patterns. Vast amounts of informationneed to be collected, processed, and synthesized to provide a meaningfulunderstanding of the different network functions. These problems make ithard for a human system administrator to manage and understand all ofthe tasks that go into the smooth operation of the network. The skillslearned from any one network may prove insufficient in managing adifferent network thus making it difficult to generalize the knowledgegained from any given network.

[0012] As described above, one of the common shortcomings of theexisting fault detection schemes is that the identification of faultsdepends upon symptoms that are specific to a particular manifestation ofa fault. Examples of these symptoms are excessive utilization ofbandwidth, number of open TCP connections, total throughput exceeded,etc. Further, there are no accurate statistical models for normalnetwork traffic and this makes it difficult to characterize thestatistical behavior of abnormal traffic patterns. Also, there is nosingle variable or metric that captures all aspects of network function.This also presents the problem of synthesizing information from metricswith widely differing statistical properties. Also, one of the majorconstraints on the development of network fault detection algorithms isthe need to maintain a low computational complexity to facilitate onlineimplementation. Hence, what is needed is a system which is independentof such symptom-specific information, and wherein faults are modeled interms of the changes they effect on the statistical properties ofnetwork traffic. Further, what is needed is a system which is easilyimplemented.

SUMMARY OF THE INVENTION

[0013] The present invention provides an improved method and system forgeneration of temporally correlated alarms to detect network problems,based solely on the statistical properties of the network traffic. Thesystem generates alarms independent of subjective criteria which areuseful only in predicting specific network fault events. The systemmonitors abrupt changes in the normal traffic to provide potentialindicators of faults. The present system overcomes the requirement ofaccurate models for normal traffic data and instead focuses on possiblefault models.

[0014] The system provides a theoretical frame-work for the problem ofnetwork fault prediction through aggregate network traffic measurementsin the form of the Management Information Base (MIB) variables. Thestatistical changes in the MIB variables that precede the occurrence ofa fault are characterized and used to design an algorithm to achievereal-time prediction of network performance problems. A subset of the171 MIB variables is first identified as relevant for predictionpurposes. This step reduces the dimensionality and the complexity of thealgorithm. The relevant MIB variables are processed to providevariable-level abnormality indicators (which indicate abrupt changepoints in the traffic measured by the variable). The algorithm accountsfor the spatial relationships between the input MIB variables using afusion center. The algorithm is successfully implemented on dataobtained from two production networks that differ from each othersignificantly with respect to their size and their nature of traffic.The alarms obtained using the system are predictive with respect to theexisting management schemes. The prediction time is sufficiently long toinitiate potential recovery mechanisms for an automated networkmanagement system.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The foregoing and other advantages and features of the inventionwill become more apparent from the detailed description of preferredembodiments of the invention given below with reference to theaccompanying drawings in which:

[0016]FIG. 1 depicts a distributed processing scheme for a Wide AreaNetwork;

[0017]FIG. 1a depicts the components of the intelligent agent processingof the present invention;

[0018]FIG. 2 depicts a typical raw MIB variable implemented as acounter;

[0019]FIG. 3 depicts a time series data obtained by differencing the MIBcounter data;

[0020]FIG. 4 depicts Case Diagrams for the MIB variables at the if andthe ip layers;

[0021]FIG. 5 depicts a key to understand the Case Diagram;

[0022]FIG. 6 depicts a use of Case Diagrams to capture relationshipsbetween MIB variables;

[0023]FIG. 7 depicts a simplified Case Diagram showing the 5 chosen MIBvariables;

[0024]FIG. 8 depicts a time series data for ifInOctets at 15 secpolling;

[0025]FIG. 9 depicts a time series data for ifOutOctets at 15 secpolling;

[0026]FIG. 10 depicts a time series data for ipInReceives at 15 secpolling;

[0027]FIG. 11 depicts a time series data for ipInDelivers at 15 secpolling;

[0028]FIG. 12 depicts a time series data for ipOutRequests at 15 secpolling;

[0029]FIG. 13 depicts a scatter plot of inInOctets and inOutOctetsshowing high degree of scatter;

[0030]FIG. 14 depicts a scatter plot of IpInReceives and ipInDeliversshowing very low correlation;

[0031]FIG. 15 depicts a scatter plot of ipInReceives and ipOutRequestsshowing very low correlation;

[0032]FIG. 16 depicts a scatter plot of ipInDelivers and ipOutRequestsshowing stronger correlation only at large increments;

[0033]FIG. 17 depicts a local distributed processing at the router;

[0034]FIG. 18 depicts a trace of ifIO before fault;

[0035]FIG. 19 depicts a trace of ifOO before fault;

[0036]FIG. 20 depicts a trace of ipIR before fault;

[0037]FIG. 21 depicts a trace of ipIDe before fault;

[0038]FIG. 22 depicts a trace of ipOR before fault;

[0039]FIG. 23 depicts correlated abrupt changes observed in the ip LevelMIB Variables;

[0040]FIG. 24 depicts an auto-correlation of ipIO showing hyperbolicdecay;

[0041]FIG. 25 depicts an auto-correlation of ifOO showing hyperbolicdecay;

[0042]FIG. 26 depicts an auto-correlation of ipIR showing hyperbolicdecay;

[0043]FIG. 27 depicts an auto-correlation of ipIDe showing hyperbolicdecay;

[0044]FIG. 28 depicts an auto-correlation of ipOR showing exponentialdecay;

[0045]FIG. 29 depicts an agent processing;

[0046]FIG. 30 depicts an alarm declaration at the fusion center;

[0047]FIG. 31 depicts a trace of if and ip variables around fault perioddenoted by asterisks;

[0048]FIG. 32 depicts a trace of if and ip variables around fault perioddenoted by asterisks;

[0049]FIG. 33 depicts histograms of the differenced MIB data;

[0050]FIG. 34 depicts a scheme for online learning showing sequentialpositions of the learning and test windows;

[0051]FIG. 35 depicts contiguous piecewise stationary windows, L(t):Learning Window, S(t): Test Window;

[0052]FIG. 36 depicts an agent processing;

[0053]FIG. 37 depicts an auto-correlation of residuals of MIB data:ifIO, ipOO, ipIR, ipIDe, ipOR;

[0054]FIG. 38 depicts a Quantile—Quantile Plot of ifIO Residuals;

[0055]FIG. 39 depicts a Quantile—Quantile Plot of ifOO Residuals;

[0056]FIG. 40 depicts a Quantile—Quantile Plot of ipIR Residuals;

[0057]FIG. 41 depicts a Quantile—Quantile Plot of ipIDe Residuals;

[0058]FIG. 42 depicts a Quantile—Quantile Plot of ipOR Residuals;

[0059]FIG. 43 depicts a detection of abrupt changes in the ifIO variableat the sensor level;

[0060]FIG. 44 depicts a detection of abrupt changes in the ifOO Variableat the sensor level;

[0061]FIG. 45 depicts a detection of abrupt changes in the ifIR variableat the sensor level;

[0062]FIG. 46 depicts a detection of abrupt changes in the ifIDevariable at the sensor level;

[0063]FIG. 47 depicts a detection of abrupt changes in the ifOR variableat the sensor level;

[0064]FIG. 48 depicts a Campus Network;

[0065]FIG. 49 depicts a Fusion Center to incorporate dependenciesbetween variable level-indicators;

[0066]FIG. 50 depicts a transitions of abrupt changes between MIBvariables;

[0067]FIG. 51 depicts a fault vector and the problem domain for the ipagent;

[0068]FIG. 52 depicts an average abnormality indicators for the iplayer;

[0069]FIG. 53 depicts a fault vectors and problem domain for the ifagent;

[0070]FIG. 54 depicts an average abnormality indicator for the if layer;

[0071]FIG. 55 depicts a persistence of abnormality;

[0072]FIG. 56 depicts a lack of persistence in normal situations;

[0073]FIG. 57 depicts an experimental network;

[0074]FIG. 58 depicts a summary of analytical results for CPUutilization;

[0075]FIG. 59 depicts a summary of experimental results for CPUutilization;

[0076]FIG. 60 depicts a CPU utilization;

[0077]FIG. 61 depicts a summary of results for theoretical values ofnetwork utilization;

[0078]FIG. 62 depicts a configuration of the monitored campus network;

[0079]FIG. 63 depicts a configuration of the monitored enterprisenetwork;

[0080]FIG. 64 depicts an average abnormality at the router;

[0081]FIG. 65 depicts an abnormality indicator of ipIR;

[0082]FIG. 66 depicts an abnormality indicator of ipIDe;

[0083]FIG. 67 depicts an abnormality indicator of ipOR;

[0084]FIG. 68 depicts an abnormality at Subnet;

[0085]FIG. 69 depicts an abnormality of ifIO;

[0086]FIG. 70 depicts an abnormality of ifOO;

[0087]FIG. 71 depicts an average abnormality at the router;

[0088]FIG. 72 depicts an abnormality indicator of ipIR;

[0089]FIG. 73 depicts an abnormality indicator of ipIDe;

[0090]FIG. 74 depicts an abnormality indicator of ipOR

[0091]FIG. 75 depicts an average abnormality at subnet;

[0092]FIG. 76 depicts an abnormality indicator of ifIO;

[0093]FIG. 77 depicts an abnormality indicator of ifOO;

[0094]FIG. 78 depicts an average abnormality at the router;

[0095]FIG. 79 depicts an abnormality indicator of ipIR;

[0096]FIG. 80 depicts an abnormality indicator of ipIDe;

[0097]FIG. 81 depicts an abnormality indicator of ipOR;

[0098]FIG. 82 depicts an average abnormality at subnet;

[0099]FIG. 83 depicts an abnormality indicator of ifIO;

[0100]FIG. 84 depicts an abnormality indicator of ifOO;

[0101]FIG. 85 depicts an average abnormality at the router;

[0102]FIG. 86 depicts an abnormality indicator of ipIR;

[0103]FIG. 87 depicts an abnormality indicator of ipIDe;

[0104]FIG. 88 depicts an abnormality indicator of ipOR;

[0105]FIG. 89 depicts an average abnormality at subnet;

[0106]FIG. 90 depicts an abnormality indicator of ifIO;

[0107]FIG. 91 depicts an abnormality indicator of ifOO;

[0108]FIG. 92 depicts a quantities used in performance analysis;

[0109]FIG. 93 depicts the prediction and detection of file serverfailures at the internal router with τ=3;

[0110]FIG. 94 depicts the prediction and detection of file serverfailures at the interface of subnet 2 with the internal router with τ=3;

[0111]FIG. 95 depicts the prediction and detection of file serverfailures at the router with τ=3103;

[0112]FIG. 96 depicts the prediction and detection of file serverfailures at subnet 26, with τ=3104;

[0113]FIG. 97 depicts the prediction and detection of network accessproblems at the router with τ=3;d

[0114]FIG. 98 depicts the prediction and detection of network accessproblems at subnet 26 with τ=3;

[0115]FIG. 99 depicts the prediction and detection of protocolimplementation error at subnet 21 and router with τ=3;

[0116]FIG. 100 depicts the prediction and detection of a runaway processat subnet 26 and router with τ=3;

[0117]FIG. 101 depicts a flow chart for implementation of the algorithm;and

[0118]FIG. 102 depicts a classification of network faults.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0119] The present invention will be described in connection withexemplary embodiments illustrated in FIGS. 1-102. Other embodiments maybe realized and other changes may be made to the disclosed embodimentswithout departing from the spirit or scope of the present invention.

[0120] System Level Design

[0121] A frame-work in which fault and performance problem detection canbe performed is provided. The selection criteria used to determine therelevant management protocol and the variables useful for the predictionof traffic-related network faults is discussed. The implementation ofthe approach developed is also presented.

[0122] Frame-Work for Fault and Performance Problem Detection

[0123] The primary concerns of real-time fault detection is scalabilityto multiple nodes 5. The scalability of the management scheme can beaddressed by local processing at the nodes 5. Agents 3 are developedthat are amenable to distributed implementation. The agents 3 use localinformation to generate temporally correlated alarms about abnormalitiesperceived at the different network nodes 5. For example, as shown inFIG. 1, a system 100 for a distributed processing scheme is provided.The information available at the router 1 is the aggregate of theinformation from all the subnets connected to that router 1. The router1, which is a network-layer device, processes the ip layer informationwhich is a multiplexing of traffic from all of the interfaces.Therefore, the output parameter of the agents implemented at the routerprovides the local view of network health. Thus, local processing at thenodes, only processed information is passed on by each device as opposedto the raw data. The alarms obtained at these individual components canthen be correlated by using standard alarm correlation techniques. Thesystem provides an intelligent agent at the level of the network node.

[0124] Referring now to FIG. 1b, the components of the intelligent agentprocessing are described. The data processing unit 29 acquires MIB data9. The change detector or sensor 33 produces a series of alarms 35corresponding to change points observed in each individual MIB variablesbased upon processed data 31. These variable-level alarms 35 arecandidate points for fault occurrences. In the fusion center 13, thevariable-level alarms 35 are combined using a priori information aboutthe relationships between these MIB variables 9. Time correlated alarms37 corresponding to the anomalies were obtained as the output of thefusion center. These alarms 37 are indicative of the health of thenetwork and help in the decisions made by the network components such asrouters, thus making it possible to provides better QoS guarantees.

[0125] Since the intelligent agent uses statistical signal processingmethods to obtain alarms, it is independent of the specificmanifestation of the anomalies. This method therefore encompasses alarger subset of anomalies and is independent of the specific scenariothat caused them.

[0126] Choice of Management Protocol

[0127] The network management discipline has several protocols in placewhich provide information about the traffic on the network. One of theseprotocols is selected as the data collection tool in order to studynetwork traffic. The criteria used in the selection of the protocol isthat the protocol support variables which correspond to trafficstatistics at the device level. An exemplary management protocol is theSimple Network Management Protocol (SNMP).

[0128] Simple Network Management Protocol—SNMP

[0129] The SNMP works in a client-server paradigm. The SNMP manager isthe client and the SNMP agent providing the data is the server. Theprotocol provides a mechanism to communicate between the manager and theagent. Very simple commands are used within SNMP to set, fetch, or resetvalues. A single SNMP manager can monitor hundreds of SNMP agents. SNMPis implemented at the application layer and runs over the User DatagramProtocol (UDP). The SNMP manager has the ability to collect managementdata that is provided by the SNMP agent, but does not have the abilityto process this data. The SNMP server maintains a database of managementvariables called the Management Information Base (MIB) variables. TheMIB variables are arranged in a tree structure following a structuringconvention called the Structure of Management Information (SMI) andcontains different variable types such as string, octet, and integer.These variables contain information pertaining to the differentfunctions performed at the different layers by the different devices onthe network. Every network device has a set of MIB variables that arespecific to its functionality. The MIB variables are defined based onthe type of device and also on the protocol level at which it operates.For example, bridges which are data link-layer devices contain variablesthat measure link-level traffic information. Routers which arenetwork-layer devices contain variables that provide network-layerinformation. The advantage of using SNMP is that it is a widely deployedprotocol and has been standardized for all different network devices.The MIB variables are easily accessible and provide traffic informationat the different layers.

[0130] Choice of Management Variables

[0131] The SNMP protocol maintains a set of counters known as theManagement Information Base (MIB) variables. A subset of these variablesis chosen to aid in the detection of traffic-related faults. Thevariables were chosen based on their ability to capture the traffic flowinto and out of the device. This process can be performed by a centralprocessing unit.

[0132] Management Information Base Variables

[0133] The Management Information Base maintains 171 variables which ismaintained in the SNMP server. These variables fall into the followinggroups: System, Interfaces (if), Address Translation (at), InternetProtocol(ip), Internet Control Message Protocol (icmp), TransmissionControl Protocol (tcp), User Datagram Protocol (udp), Exterior GatewayProtocol (egp), and Simple Network Management Protocol (snmp). Eachgroup of variables describes the functionality of a specific protocol ofthe network device. Depending on the type of node monitored, anappropriate group of variables was considered. These variables are userdefined. Here, the node being monitored is the router and therefore ifand the ip group of variables are investigated. The if group ofvariables describe the traffic characteristics at a particular interfaceof the router and the ip variables describe the traffic characteristicsat the network layer. The MIB variables are implemented as counters asshown in FIG. 2 (the counter resets at a value of 4294967295). Thevariables have to be further processed in order to obtain an indicatoron the occurrence of network problems. Time series data for each MIBvariable is obtained by differencing the MIB variables (the differenceddata is illustrated in FIG. 3).

[0134] The relationships between the MIB variables of a particularprotocol group can be represented using a Case Diagram. Case Diagramsare used to visualize the flow of management information in a protocollayer and thereby mark where the counters are incremented. The Casediagram for the if and ip variables flow between the lower and uppernetwork layers. A key to the understanding of the Case Diagram is shownin FIG. 5. An additive counter counts the number of traffic units thatenter into a specific protocol layer and a subtractive counter countsthe number of traffic units that leave the protocol layer. The variablesthat are depicted in the Case Diagram by a dotted line are called filtercounters. A filter counter is a MIB variable that measures the level oftraffic at the input and at the output of each layer.

[0135] In FIG. 4 variables such as ifInDiscards and ifOutDiscards aresubtractive counters while variables such as ipFragCreates are additivecounters. A simple example to illustrate the use of these diagrams isthe number of ip datagams that failed at reassembly (ipReasmFails) whichis given by,

[0136] ipReasmFails=ipReasmReqds−ipReasmOks

[0137] This relationship is represented in the Case Diagram andemphasized in FIG. 6.

[0138] Selection of a Relevant Set of MIB Variables

[0139] The choice of a relevant set of MIB variables that are relevantto the detection of traffic-related problems helps reduce thecomputational complexity by reducing the dimensionality of the problem.This step can be user defined. Within a particular MIB group thereexists some redundancy. For example, the variables interface Out Unicastpackets (ifOU), interface Out Non Unicast packets (ifONU) and interfaceOut Octets (ifOO). The ifOO variable contains the same trafficinformation as that obtained using both ifOU and ifONU.

[0140] In order to simplify the problem, such redundant variables arenot considered. Some of the variables, by virtue of their standarddefinition, are not relevant to the detection of traffic-related faults,e.g., ifIndex (which is the interface number) is excluded. MIB variablesthat show specific protocol implementation information, such asfragmentation and reassembly errors, are also not included. For example,the variable ifIE (which represents the number of errored bytes thatarrived at a particular interface) is not considered. In currentnetworks such errors are corrected by the protocols themselves usingretransmission schemes. Fault situations of interest (i.e., faults whicharise due to increased traffic, transient failure of network devices,and software related problems) may not be reflected in these errorvariables.

[0141] There is no single variable that is capable of capturing allnetwork anomalies or all manifestations of the same network anomaly.Therefore, five MIB variables are selected. In the if layer, thevariables ifIO (In Octets) and ifOO (Out Octets) are used to describethe characteristics of the traffic going into and out of that interfacefrom the router. Similarly in the ip layer, three variables are used.The variable ipIR (In Receives), represents the total number ofdatagrams received from all interfaces of the router. IpIDe (InDelivers), represents the number of datagrams correctly delivered to thehigher layers as this node was their final destination. IpOR (OutRequests), represents the number of datagrams passed on from the higherlayers of the node to be forwarded by the ip layer. The ip variablessufficiently describe the functionality of the router. The ip layervariables help to isolate the problem to the finer granularity of thesubnet level. The chosen variables are depicted in FIG. 7 by a dottedline. These variables are not redundant and represent cross sections ofthe traffic at different points in the protocol stack. They correspondto the filter counters in FIG. 4. Typical trace of each of thesevariables over a two hour period is shown in FIGS. 8 through 12. The ifvariables are obtained in terms of bytes or octets. These variablescorrespond to the traffic that goes into and out of an interface andtherefore show bursty behavior. The traffic is measured by the sensor 33of FIG. 1b. The ip level variables are obtained as datagrams. The ipIRvariable measures the traffic that enters the network layer at aparticular router and therefore shows bursty behavior. The ipIDe andipOR variables are less bursty since they correspond to traffic thatleaves or enters the network layer to or from the transport layer of therouter. The traffic associated with these variables comprises only afraction of the entire network traffic. However, in the case of faultdetection these are relevant variables since the router does someprocessing of the routing tables in fault instances in order to updatethe routing metrics.

[0142] The five MIB variables chosen are not strictly independent.However, the relationships between these variables are not obvious.These relationships depend on parameters of the traffic such as sourceand destination of the packet, processing speed of the device, and theactual implementation of the protocol. The extent of relationshipsbetween the chosen variables is shown with the help of scatter plots inFIGS. 13 to 16. In FIG. 13 although the increments in the ifIO and theifOO counters show some correlation, these correlations are very smallas seen from the high degree of scatter. The average cross correlationbetween these two variables is 0.01. In FIGS. 14 and 15 the variablesipIDe and ipOR have no obvious relationship with ipIR. The averagecorrelation of ipIR with ipIDe is 0.08 and with ipOR is 0.05. In FIG. 16there is some significant correlation in the ipOR and ipIDe variables atlarge increments. The average cross correlation between ipOR and ipIDeis 0.32. The cross correlations are computed using normal data over aperiod of 4 hours.

[0143] One of the limitations in the choice of the specific MIBvariables is that the isolation and diagnosis of the problem isrestricted to the subnet level. Further isolation to the applicationlevel will require that additional MIB variables be included.

[0144] The Intelligent Agent and Implementation Scheme

[0145] Here, intelligent agents have been designed to perform the taskof detecting network faults and performance degradations in real time.Intelligent agents are software entities that process the raw MIB dataobtained from the devices to provide a real-time indicator of networkhealth. These agents can be deployed in a distributed fashion across thedifferent network nodes.

[0146] The agent 3 processing at each node 5 is separated into smallerunits dealing with each specific protocol layer. In the case of therouter 1, the interface layer information (ip) and the network (ip)layer information is processed independently (see FIG. 17, 3a, 3 b).This separation of tasks allows the agent 3 to scale easily for anynumber of interfaces that a router 1 may have. The interface layerprocessing or the if agent yields an indicator that measures the healthof the specific subnet connected to a particular interface of the router1. However, the if agent 3 b alarms would be unable to detect problemsat another interface port. Using all the if variables at a router 1, theintelligent agent should be able to detect network problems that occurin all the subnets 7. The processing at the network layer or the ipagent provides an indicator for the network health as perceived by therouter. However, without the ip variables, problems at the router 1would not get detected promptly, and the propagation of the faultthrough the network would not be observed. Therefore using thedistributed scheme shown in FIG. 17, a problem at a router 1 can befurther isolated to the subnet 7 level.

[0147] Proposed Model for Network Faults

[0148] Faults refer to circumstances where correction is beyond thenormal functional range of network protocols and devices. Faults affectnetwork availability immediately or indicate an impending adverseeffect. Network faults and performance problems can be broadlyclassified as either predictable or non-predictable faults. Predictablefaults are preceeded by indications that allow inference of an impendingfault. The opposite is true in the case of non-predictable faults.Non-predictable faults correspond to events in which these adverseeffects occur simultaneously with their indications.

[0149] Predictable and Non-Pedictable Faults

[0150] Examples of predictable faults are: file server failures, pagingacross the network, broadcast storms and a babbling node. These faultsaffect the normal traffic load patterns in the network. For example, inthe case of file server failures such as a web server, it is observedthat prior to the fault event there is an increase in the number of ftprequests to that server. Network paging occurs when an applicationprogram outgrows the memory limitations of the work station and beginspaging to a network file server. This may not affect the individual userbut affects others on the network by causing a shortage of networkbandwidth. Broadcast storms refer to situations where broadcasts areheavily used to the point of disabling the network by causingunnecessary traffic. A babbling node is a situation where a node sendsout small packets in an infinite loop in order to check for someinformation such as status reports. This fault only manifests itselfwhen the average network utilization is low since it has a negligiblecontribution to heavy traffic volumes. Congestion at short time scalesis an example of a performance problem that can be predicted by closelymonitoring the network traffic characteristics. Here, predictability isdefined with respect to any existing indications such as syslogmessages. The primary cause for predictable faults can be eitherhardware (such as a faulty interface card) or software related.

[0151] An example of a non-predictable fault is a link break, i.e., whena functioning link has been accidentally disconnected. Such faultscannot be predicted. On the other hand, non-predictable faults such asprotocol implementation errors can result in increased traffic loadcharacteristics thus allowing for detection. For example, the presenceof an accept protocol error in a super server (inetd), results inreduced access to the network which in turn affects network trafficloads. The symptom thus observed in the traffic loads can then bedetected as an indication of a fault.

[0152] Here, both predictable and non-predictable faults that aretraffic related are examined. It is possible to identify traffic-relatedfaults by the effect they cause in normal network behavior. Thedefinition of normal network behavior is dependent on the dynamicsinvolved in the network in terms of the traffic volume, the type ofapplications running on the network, etc. Since network traffic exhibitsfractal behavior, there are no analytically simple models that can beused to learn the normal behavior. To circumvent the problem of accuratetraffic models, the present sytem models network fault behavior asopposed to normal behavior.

[0153] Deviations from normal network behavior that occur before orduring fault events can be associated with transient signals caused bythe performance degradation. Therefore, it is premised that faults canbe identified by transient signals that are produced by a performancedegradation prior to or during a full blown failure.

[0154] Experimental Study of the Structure of Network Faults Using MIBVariables

[0155] In general, network traffic can be measured in terms of thenetwork load such as packet transmission rate. However, to obtain afiner resolution at the different nodes on the network it is beneficialto use the traffic-related Management Information Base (MIB) variables.To better define network faults, a specific fault manifestation isdiscussed. This particular fault occurred on a campus LAN network andcorresponded to a file server failure that was reported by 36 machinesof which 12 were located on the same subnet as the file server. Thefault lasted for a duration of seven minutes. FIGS. 18 through 22 showthe trace of the different traffic-related MIB variables at the iplayer, 2 hours before the fault was observed by the existing mechanismssuch as syslog messages. The fault was observed (by detecting changes inthe statistics of the traffic data) in the syslog messages generated bythe machines experiencing faulty conditions. This particular fault is agood illustrative case as the deviations from normal network behaviorare more easily observable in the traffic traces. The extent ofdeviation from normal behavior is different for different variables andalso varies based on the manifestation of the fault. In the casediscussed there is a significant change in the mean level of trafficobserved in the ifOO variable as compared to the ifIO variable. Thesituation observed in the ifOO variable is one extreme case. In the iplevel variables the changes observed in the ipIDe and ipOR variables aremuch more subtle than the changes in the ipIR variable. Therefore, moresophisticated methods are required to detect these subtle changes. Thedetection results obtained in the case of the ip variables are shown inFIG. 23.

[0156] Another important aspect to be noted is that the subtle abruptchanges associated with the fault events occur in a correlated fashionacross the different MIB variables of a particular protocol layer. Notein FIGS. 20 through 22 that there are abrupt changes observed in all thethree ip level variables less than one half hour before the faultoccurred. Results showing correlated abrupt changes for this specificfault under discussion are shown in FIG. 23. The Y axis represents themagnitude of the abrupt changes. Note that abrupt changes are detectedin all of these MIB variables prior to the fault. This is found to betrue in the case of the if level variables as well.

[0157] Non-Stationarity in MIB Data

[0158] It is found that some of the MIB variables are non-stationary.Since the non-stationary (long-range dependent) variables do not haveaccurate models, a more sophisticated method of distinguishing thedeviations from normal network behavior is required. Adaptive learningmethods are used to address the problem of non stationarity.

[0159] An accurate estimation of the Hurst Parameter for the MIBvariables is difficult due to the lack of high resolution data.Therefore, the long-range dependent behavior of the MIB variables isobserved in terms of the autocorrelation functions (see FIGS. 24-28).For the ifIO, ifOO, and ipIR variables, (see FIGS. 24, 25, and 26) theautocorrelation is significantly high even at very large lags. At 50lags (12.5 mins) the ifIO variable has an autocorrelation value of 0.3,the ifOO variable has an autocorrelation value of 0.81, and the ipIRvariable has an autocorrelation value of 0.6. There is a slow decay inthe auto correlation function thus giving rise to a hyperbolic ratherthan an exponential decay. This observation is indicative of long rangedependence. In FIGS. 27 and 28 the autocorrelation for the variablesipIDe and ipOR decays exponentially, showing that these variables arenot fractal in nature. The variables ifIO, ifOO, and ipIR relate toactual traffic traces and have long-range dependence. Thus, in the caseof the ifIO, ifOO and ipIR variable the normal MIB data is long-rangedependent. For the variables inIDe and ipOR the normal MIB data areshort-range dependent.

[0160] Proposed Model of Network Faults

[0161] It is proposed that faults can be modeled as correlated transient(short-range dependent) signals that are embedded in background MIBdata. The transient signals manifest themselves as abrupt changes. Anabrupt change is any change in the parameters of a signal that occurs onthe order of the sampling period of the measurement of the signal. Here,the sampling period was 15 seconds. Therefore, an abrupt change isdefined as a change that occurs in the period of approximately 15seconds. The transient changes can be expressed mathematically using theaverage autocorrelation. In the case of a purely long-range dependentprocess we have that the autocorrelation r(k) satisfies the property,${\sum\limits_{k}{r(k)}} = \infty$

[0162] where r(k)˜k^(2H−2) as k→∞. k is the number of lags and H whichsatisfies H>0.5 is the Hurst Parameter. This results in the hyperboliccurve of the correlogram as seen in FIGS. 24 through 26. However, in thecase of transient signals that cause the correlogram to decayexponentially we have, $0 < {\sum\limits_{k}{r(k)}} < \infty$

[0163] where, r(k)˜ρ^(k) as k→∞ and the correlation coefficient ρsatisfies |ρ|≦1.

[0164] The abrupt changes can be modeled using an Auto-Regressive (AR)process. Since these abrupt changes propagate through the network, theycan be traced as correlated events among the different MIB variables.This correlation property distinguishes abrupt changes intrinsic tofault situations from those random changes of the system which arerelated to the network's normal function. In conclusion, traffic-relatedfaults of interest can be defined by their effect on network trafficsuch that before or during a fault occurrence, traffic-related MIBvariables undergo abrupt changes in a correlated fashion.

[0165] Problem Statement and Algorithm

[0166] Using the above model for network faults, the fault detectionproblem can be posed such that given a sequence of traffic-related MIBvariables 9 sampled at a fixed interval, a network health function canbe generated that can be used to declare alarms corresponding to networkfault events. The fault model is used to develop a detection scheme todeclare an alarm at some time t_(a) which corresponds to an impendingfault situation or an actual fault event. The steps involved aredescribed below and depicted pictorially in FIG. 29.

[0167] Step (1): The statistical distribution of the individual MIBvariables 9 are significantly different thus making it difficult to dojoint processing of these variables 9. Therefore, sensors 11 areassigned individually for each MIB variable 9. The abrupt changes in thecharacteristics of the MIB variables 9 are captured by these sensors 11.The sensors 11 perform a hypothesis test based on the Generalizedlikelihood Ratio (GLR) test and provide an abnormality indicator that isscaled between 0 and 1. The abnormality indicators are collected to form{right arrow over (ψ)}(t)₁bnormality vector. The a{right arrow over(ψ)}(t)mality vector is a measure of the abrupt changes in normalnetwork behavior. This measure is obtained in a time-correlated fashion.

[0168] Step (2): The fusion center 13 incorporates the spatialdependencies between the abrupt changes in the individual MIB variables9 into the abnormality vector by using a linear operator A. Inparticular the quadratic functional:

ƒ({right arrow over (ψ)}(t))={right arrow over (ψ)}(t)A{right arrow over(ψ)} (t),

[0169] is used to generate a continuous scalar indicator 15 of networkhealth. This network health indicator 15 is interpreted as a measure ofabnormality in the network as perceived by the specific node. Thenetwork health indicator 15 is bounded between 0 and 1 by atransformation of the operator A. A value of 0 represents a healthynetwork and a value of 1 represents maximum abnormality in the network.

[0170] Step (3): The operator matrix A is an M×M matrix (M is the numberof sensors). In order to ensure orthogonal eigenvectors which form abasis for R^(M) and real eigenvalues, the matrix A is designed to besymmetric. Thus it will have M orthogonal eigenvectors with M realeigenvalues. A subset of these eigenvectors are identified thatcorrespond to fault states in the network. Let λ_(fmin) and λ_(fmax) bethe minimum and maximum eigenvalues that correspond to these faultstates. The problem of alarm generation by the agent 3 can then beexpressed as:

t _(a) =inƒ{t:λ _(fmin)≦ƒ({right arrow over (ψ)}(t))≦λ_(fmax)}

[0171] where t_(a) is the earliest time at which the functionalƒ(_(ψ)(t)) exceeds λ_(fmin). (see FIG. 3.13). Each time the condition issatisfied, there is a potential alarm. In order to declare alarms thatcorrespond to a fault situation, persistence criteria is further imposedon the potential alarm conditions.

[0172] Detection of Abrupt Changes in Management Information BaseVariables

[0173] It has been experimentally shown that changes in the statisticsof traffic data can in general be used to detect faults. According tothe present fault model, network faults manifest themselves as abruptchanges in the traffic-related MIB variables. Since the MIB variableshave different statistical distributions, some of which arenon-Gaussian, joint processing is not possible. Hence, for eachindividual MIB variable a sensor is designed to detect the abruptchanges. Since the MIB variables are not strictly independent, they havenon-zero cross correlations. These correlations are time varying and areaccounted for when the variable level sensor outputs are combined at thefusion center. This method of incorporating the correlations is anadvantage in terms of reducing the complexity of the algorithm.

[0174] Faults produce abrupt changes in network traffic that requiremore sophisticated methods than second-order statistics in order to bedetected. FIGS. 31 and 32 illustrate the behavior of the MIB variablesaround the fault region in two different cases. The column of asterisksand dots in the figures indicate when a network fault occurred. Notethat there does not seem to be a drastic change in the overall behavior(1 hour) of the data trace before a fault occurs. In FIG. 31, theperiodicities inherent to the network traffic dominate the trace sincethe mean traffic level was low during the early hours (2 am) of the daywhen this particular fault occurred.

[0175] Change Detection

[0176] In most problems with multiple input variables a simplemultivariate hypothesis test is employed to perform detection usingparametric procedures. However, multivariate hypothesis testing requiresknowledge of the joint statistics of the input variables as well as someassumptions of stationarity. Since the MIB variables are highlynon-stationary and there is no prior information available about thestatistics of the normal traffic as well as the alternate faulthypothesis, multivariate hypothesis testing is not amenable. Thehistogram of the differenced time series corresponding to each MIBvariable is presented in FIG. 33. The histogram of the data is shown toprovide a sense of the distribution of these variables.

[0177] Online Learning/Detection

[0178] The time series data obtained from the MIB variables arenon-stationary, thus an adaptive learning algorithm to account for thenormal drifts in the traffic is required. Hypothesis testing isperformed by comparing two adjacent non-overlapping windows of the timeseries, the learning window L(t) and the test window S(t). The length ofthese windows is chosen so that the time series data within thesewindows could be considered piecewise stationary. As time increments,these windows slide across the time series as depicted in FIG. 34.

[0179] Hypothesis Testing using Generalized Likelihood Ratio

[0180] A sequential hypothesis test is performed to determine whether achange has occurred going from the learning window to the test window.Since faults are manifested as abrupt changes, the piecewise stationarysegments of the data (learning and test windows) are modeled using an ARprocess of order p. The hypothesis test based on the power of theresidual signals in the segments is performed to determine if a changehas occurred.

[0181] Consider a learning window L(t) and test window S(t) of lengthsN_(L) and N_(s) respectively as in FIG. 35. First, consider the learningwindow L(t):

L(t)={l₁(t), l₂(t), . . . , l_(N) _(L) (t)}

[0182] We can express any l_(i)(t) as {overscore (l)}_(i)(t) where{overscore (l)}_(i)(t)=l_(i)(t)−μ and is the mean of the segment L(t).Now {overscore (l)}_(i)(t) is modeled as an AR order p process with aresidual error ε_(i)${{\varepsilon_{i}(t)} = {\sum\limits_{k = 0}^{p}\quad {\alpha_{k}{{\overset{\_}{l}}_{i}\left( {t - k} \right)}}}},$

[0183] where α_(L)={α₁ α₂, . . . , α_(p)} and α₀=1 are the ARparameters.

[0184] Assuming that each residual time sample is drawn from an N(0,σ_(L) ²) distribution, the joint likelihood of the residual time seriesis obtained as

[0185] where σ_(L) ² is the variance of the segment L(t), and N′_(L),N_(L)−p, and {circumflex over (σ)}_(L) ²${{p\left( {\varepsilon_{p + 1},\ldots \quad,{\varepsilon_{N_{L}}/\alpha_{1}},\ldots \quad,\alpha_{p}} \right)} = {\left( \frac{1}{\sqrt{2{\pi\sigma}_{L}^{2}}} \right)^{N_{L}^{\prime}}{\exp \left( \frac{{- N_{L}^{\prime}}{\hat{\sigma}}_{L}^{2}}{2\sigma_{L}^{2}} \right)}}},$

[0186] is the covariance estimate of σ_(L) ². A similar expression canbe obtained for the test window Segment S(t). Now the joint likelihood νof the two segments L(t) and S(t) is given as,$ = {\left( \frac{1}{\sqrt{2{\pi\sigma}_{L}^{2}}} \right)^{N_{L}^{\prime}}\left( \frac{1}{\sqrt{2{\pi\sigma}_{S}^{2}}} \right)^{N_{S}^{\prime}}{\exp \left( \frac{{- N_{L}^{\prime}}{\hat{\sigma}}_{L}^{2}}{2\sigma_{L}^{2}} \right)}{\exp \left( \frac{{- N_{L}^{\prime}}{\hat{\sigma}}_{S}^{2}}{2\sigma_{S}^{2}} \right)}}$

[0187] where σ_(S) ² is the variance of the segment S (t), andN_(S)=N_(s)−p, and {circumflex over (σ)}_(S) ² is the covarianceestimate of σ_(S) ². The expression for ν is a sufficient statistic andis used to perform a binary hypothesis test based on the GeneralizedLikelihood Ratio. The two hypotheses are H₀, implying that no change isobserved between the learning and the test segments, and H₁, implyingthat a change is observed. Under the hypothesis H₀ we have,

α_(L)=α_(S),

σ_(L) ²=σ_(S) ²=σ_(p) ².

[0188] where σ_is the pooled variance of the combined learning and testsegments. Therefore under hypothesis H₀ the likelihood μ₀ becomes,$_{0} = {\left( \frac{1}{\sqrt{2{\pi\sigma}_{P}^{2}}} \right)^{N_{L}^{\prime} + N_{S}^{\prime}}{\exp \left( \frac{{- \left( {{\overset{.}{N}}_{L} + {\overset{.}{N}}_{S}} \right)}{\hat{\sigma}}_{P}^{2}}{2\sigma_{P}^{2}} \right)}}$

[0189] Under hypothesis H₁ we have,

α_(L)≠α_(S),

σ_(L) ²≠σ_(S) ².

[0190] implying that a change is observed between the two windows. Hencethe likelihood ν₁ under H₁ becomes,

ν₁=ν

[0191] In order to obtain a value for the generalized likelihood ratio77 that is bounded between 0 and 1, we define 77 as follows,$\eta = \frac{_{1}}{_{1} + _{0}}$

[0192] Furthermore, on using the maximum likelihood estimates for thevariance terms we get;$\eta = \frac{{\hat{\sigma}}_{L}^{- {\overset{.}{N}}_{L}}{\hat{\sigma}}_{S}^{- {\overset{.}{N}}_{S}}}{{{\hat{\sigma}}_{L}^{- {\overset{.}{N}}_{L}}{\hat{\sigma}}_{S}^{- {\overset{.}{N}}_{S}}} + {\hat{\sigma}}_{P}^{- {({{\overset{.}{N}}_{L} + N_{S}})}}}$

[0193] Using this approach, a measure of the likelihood of abnormalityfor each of the MIB variables 9 as the output of the individual sensors11 is obtained. These indicators 15, which are functions of system time,are updated every N, lags. The indicators 15 provided by the sensors 11form the abnormality vector which is fed into the fusion center 13 asshown in FIG. 36. The abnormality{right arrow over (ψ)}(t)tor iscomposed of elemenψ_(i)(t) where,

ψ_(i)(t)=η

[0194] for the ith MIB variable.

[0195] Study of Residuals

[0196] Network traffic has been shown to exhibit long-range dependence.Therefore, it is necessary to explore the time lagged properties of theresiduals of the piecewise stationary segments obtained from thetraffic-related MIB data. The correlation function of a typical residualsignal obtained from the different MIB variables is shown in FIG. 37.The correlogram is obtained over 50 time lags (approx 12.5 mins). Eachtime lag corresponds to 15 seconds. Note that there is no significantcorrelation after 10 lags (2.5 mins).

[0197] The quantile distribution of the residuals of the MIB variablesare plotted against the quantiles of a standard normal distribution inFIGS. 38 through 42. When there is a noticeable ‘S’ shape in thequantile-quantile plot the residuals slightly differ from a standardnormal distribution in that the former have a longer tail. Therefore asseen from the figures, the if variables can be better approximated asGaussian random variables than the if variables. However, since only thefirst two moments of the residual time series is concerned, the Gaussianapproximation for the residual error distribution of all the variablesis utilized.

[0198] Implementation

[0199] The implementation of the change detection algorithm depends onthe choice of the window size N_(L) for the learning window and N_(s)for the test window as well as p, the order of the AR process. A higherorder of the AR process will model the data in the window moreaccurately but will require a large window size due to the requirementthat a minimum number of samples are necessary to be able to estimatethe AR parameters accurately. An increase in window size will result ina delay in the prediction of an impending fault. Subject to theseconstraints, we choose the test window size N_(S)=20 samples (5 min).The length of the learning window N_(L) is experimentally optimized forthe different MIB variable. The ipIR, ifIO, and ifOO variables require alearning window N_(L) of 20 samples (5 mns at 15 sec polling). In thecase of the campus network the variables ipIDe and ipOR have an optimallearning window N_(L) of 480 samples (120 mins at 15 sec polling). Inthe case of the enterprise network it was found that the variables ipIDeand ipOR were more bursty and therefore N_(L) was reduced to 120 samples(30 mins at 15 sec polling). The system implies that when the learningwindow is increased beyond the optimal window size, no changes aredetected. The difference in the learning window sizes for the differentMIB variables can be attributed to the bursty behavior of the first setof variables.

[0200] Adequate representation of the signal and parsimonious modelingare competing requirements. Hence, a trade off between these two issuesis necessary. The accuracy of the model is measured in terms of Akaike'sFinal Prediction Error (FPE) criterion. The order corresponding to aminimum prediction error is the one that best models the signal. Howeverdue to singularity issues there is a constraint on the order p,expressed as:

0≦p≦0.1N

[0201] where N is the length of the sample window. In order to comparethe residuals from the learning and the test windows, it is necessary touse the same AR order to model the data in both these windows. Hence thevalue of N is constrained by the length of the test window N_(S)=20samples. The appropriate order for p is chosen to be 1 since itminimizes the FPE subject to the constraints of the problem.

[0202] Results

[0203] Examples of the change detection algorithm applied to the fiveMIB varables in one typical fault case is shown in FIGS. 43 through 47.The MIB variable data is plotted alongside the output abnormalityindicators. The trace corresponds to a 4 hour period. The fault regionis denoted using asterisks. The abnormality indicators in general riseprior to the fault event. However, there are times when the abnormalityindicator for a single variable rises high in the absence of a fault.These situations contribute to some of the false alarms generated by theagent. Note, that there are relatively higher number of such alarms inthe variables ifIO, ifOO, and ipIR. It is proposed that this is due tothe bursty nature of these variables and the inability of the singletime scale algorithm to learn the normal behavior accurately.

[0204] The results of the change detection algorithm are summarized inFIG. 48. In FIG. 48, it is concluded that the ipOR variable is a goodindicator of network anomalies since changes corresponding to all thefaults were detected in the indicator for this variable. Furthermore, inaccordance with the proposed fault model, the abrupt changes associatedwith a network fault can be distinguished only if the changes occurrencecorrelated fashion among the different MIB variables. Under normalconditions the abrupt changes are less correlated between the differentMIB variables. Therefore all the five variables are needed to predictnetwork faults. Furthermore, using more than one variable will helpreduce the occurrence of false alarms. This motivated the need tocombine the information obtained from the individual sensors (associatedwith the different MIB variables) at the fusion center.

[0205] Combination of Sensor Information: Fusion Center

[0206] Although alarms obtained at tie sensors for each variable canindicate some problematic behavior, they contain only partial and noisyinformation about a potential network problem. Therefore to reduce thefalse alarms generated at the variable level, it is necessary to combinethe information from the sensors. Even though the MIB variables aredependent, the sensor outputs are obtained by treating the MIB variablesindependently. Therefore the outputs of the sensors need to be combinedto take into account these dependencies.

[0207] In accordance with the present model for network faults, a methodfor identifying correlated changes in the MIB variables 9 must bedeveloped. This task is accomplished using a fusion center 13. Thefusion center 13 is used to incorporate these spatial dependencies intothe time correlated variable-level abnormality indicators 15. The outputof the fusion center 13 is a single continuous scalar indicator 15 ofnetwork level abnormality as perceived by the node level agent (see FIG.49). The system employs two different methods at the fusion center 15: aduration filter approach and an approach using a linear operator. Thelinear operator method is found to be more amenable to onlineimplementation and is able to combine the variable-level information ina more straightforward manner than the duration filter.

[0208] Duration Filter

[0209] In the combination scheme, the sensor level output is combinedusing a duration filter. The duration filter is implemented on thepremise that a change observed in a particular variable should propagateinto another variable that is higher up in the protocol stack. Forexample, in the case of the ifIO variable, the flow of traffic istowards the ipIR variable and therefore an abrupt change in the ifIOvariable should propagate to the ipIR variable. Using the relationshipsfrom the Case diagram representation shown in FIG. 4, all possibletransitions between the chosen variables are determined (see FIG. 50).The duration filter is designed to detect all four transition types. Thetime interval between transitions represents the duration filter. Thelength of the duration filter for each transition is experimentallydetermined. Transitions that occur within the same protocol layer (ipIRto ipIDe) require a duration filter of length 15 seconds which is thesampling rate of the MIBs. However, for transitions that occur betweenthe if and the ip, layers a significantly longer duration filter of 20to 30 min is required. The duration filter generates a single alarm thatcorresponds to both the interface (if) and the network (ip) layer.Hence, no new scheme is required to combine the information obtainedfrom the different protocol layers to provide a single node level alarm.However, the disadvantage is that the estimation of the values of thetransition times between the different variables is difficult,especially in the case of transitions between protocol layers. Thisresulted in the use of larger values for duration filter sizes to ensurethe detection of different faults, which generated more false alarms.Furthermore, the alarms generated by the agent are of binary nature (0or 1), thus obscuring the trends in abnormality. Trends are essential inorder to provide a confidence measure to the declared alarms beforepotential recovery schemes are deployed.

[0210] The Linear Operator: A and the Quadratic Functional ƒ({rightarrow over (ψ)}(t))

[0211] We hypothesize that the spatial dependencies in the abnormalityvector {right arrow over (ψ)}(t) can be captured using a linear operatorA at the fusion center. In analogy to quantum mechanics the observableof this operator is interpreted as the abnormality indicator and theexpectation of the observable is the scalar quantity λ used to indicatethe average abnormality of the network as perceived by the agent.

[0212] Analogy of Quantum Mechanics

[0213] In quantum mechanics, measurable quantities are described by anoperator A acting on a vector in a state space. The measurable quantityis also referred to as an observable. An example of an operator is theHamiltonian H, which operates on a vector {right arrow over (ψ)} in thestate space to return the observable, which is the total energy in thesystem. In this case, the state space is spanned by the set ofeigenvectors {right arrow over (φ)} of the operator H. The eigenvectors{right arrow over (φ)} of H satisfy the equation:

[0214]

[0215] E_(i) is the energy of the eigenstate {right arrow over (φ)}_(i).In general the state vector {right arrow over (ψ)}1 may not be aneigenvector. In this case {right arrow over (ψ)} can be expressed as itsspectral decomposition onto the eigenvector basis:$\overset{\rightarrow}{\left. \psi \right|} = {\sum\limits_{i}^{\quad}\quad {c_{i}{\overset{\rightarrow}{\varphi}}_{i}}}$

[0216] Then the operation of H can be expressed as follows:$\begin{matrix}{{H\overset{\rightarrow}{\left. \psi \right|}} = {H{\sum\limits_{i}^{\quad}{c_{i}{\overset{\rightarrow}{\varphi}}_{i}}}}} \\{{= {\sum\limits^{\quad}{c_{i}E_{i}{\overset{\rightarrow}{\varphi}}_{i}}}}}\end{matrix}$

[0217] In this equation, E_(i) is the eigenvalue corresponding to theeigenvector {right arrow over (φ)}_(i) Notice that in the aboveequation, the quantity H {right arrow over (ψ)}1 can no longer beequated with a term E{right arrow over (ψ)}1 since {right arrow over(ψ)}1 is in general not an eigenvector. In this case, although there isno exact value of the energy E, we can extract an expectation for theenergy.

[0218] In quantum mechanics, the outcome of an experiment cannot beknown with certainly. All that can be known is, the probability ofmeasuring an energy E_(i), when the operator H acts on the state {rightarrow over (ψ)}1. This probability is defined as follows:$\begin{matrix}{{p\left( E_{i} \right)} = {{\langle{{\overset{\rightarrow}{\varphi}}_{i} \cdot \overset{\rightarrow}{\psi}}\rangle}}^{2}} \\{= {{\langle{{\overset{\rightarrow}{\varphi}}_{i} \cdot {\sum\limits_{j}\quad {c_{j}{\overset{\rightarrow}{\varphi}}_{j}}}}\rangle}}^{2}} \\{= {{\langle{\sum\limits_{j}\quad {c_{j}{{\overset{\rightarrow}{\varphi}}_{i} \cdot {\overset{\rightarrow}{\varphi}}_{j}}}}\rangle}}^{2}} \\{= {{\langle{c_{j}\delta_{ij}}\rangle}}^{2}} \\{= {c_{i}}^{2}}\end{matrix}$

[0219] After a large number of measurements H are performed on a systemin a particular state {right arrow over (ψ)}1, the probability ofmeasuring E_(i) would be:${p\left( E_{i} \right)} = \frac{{number}\quad {of}\quad {measurements}\quad E_{i}}{{total}\quad {number}\quad {of}\quad {measurements}}$

[0220] that is,$\frac{N\left( E_{i} \right)}{N}\overset{N\rightarrow\infty}{\rightarrow}{P\left( E_{i} \right)}$

[0221] Therefore, the expectation of the observable quantity E can becalculated as follows: $\begin{matrix}{{\langle E\rangle} = {\overset{\rightarrow}{\psi}H\quad \overset{\overset{.}{\rightarrow}}{\phi}}} \\{= {\sum\limits_{i}\quad {c_{i}{\overset{\rightarrow}{\varphi}}_{i}{\sum\limits_{j}\quad {c_{j}E_{j}{\overset{\rightarrow}{\varphi}}_{j}}}}}} \\{= {\sum\limits_{i,j}\quad {c_{i}c_{j}E_{j}\delta_{ij}}}} \\{= {\sum\limits_{i}\quad {c_{i}^{2}E_{i}}}} \\{= {\sum\limits_{i}\quad {E_{i}{p\left( E_{i} \right)}}}}\end{matrix}$

[0222] Here, the observable that represents network abnormality asperceived by the node. In the fault model, network abnormality isdefined as correlated abrupt changes in the MIB variables. Thus anoperator matrix A to measure the degree of correlation in the inputabnormality vectors is designed. The state space is composed ofabnormality vectors formed from the variable-level abnormalityindicators. The eigenvalues measure the magnitude of abnormalityassociated with a given eigenvector. Thus based on the magnitude of theeigenvalues, the corresponding eigenvectors are classified as fault ornon-fault vectors.

[0223] Design of the Operator Matrix

[0224] First a (1×m) input vector {right arrow over (ψ)}(t) isconstructed with components:

{right arrow over (ψ)}(t)=[ψ₁(t) . . . ψ_(m)(t)]

[0225] Each component of this vector corresponds to the probability ofabnormality associated with each of the MIB variables as obtained fromthe sensors. In order to complete the basis set so that all possiblestates of the system are included, an additional component ψ₀(t) thatcorresponds to the probability of normal functioning of the network iscreated. The final component allows for proper normalization of theinput vector. The new input vector, {right arrow over (ψ)}(t),

{right arrow over (ψ)}(t)=α[ψ₁(t) . . . ψ_(m)(t)ψ₀(t)]

[0226] is normalized with a as the normalization constant. Bynormalizing the input vectors the expectation of the observable of theoperator can be constrained to lie between 0 and 1.

[0227] Consider the case where M sensor outputs are fed into the fusioncenter. The appropriate operator matrix A will be (M+1)×(M+1). We designthe operator matrix to be Hermitian in order to have an eigenvectorbasis. Taking the normal state to be un coupled to the abnormal stateswe get a block diagonal matrix with an M×M upper block Aupper and a 1×1lower block: $A = \begin{bmatrix}a_{11} & a_{12} & . & . & a_{1{({M - 1})}} & a_{1M} & 0 \\a_{21} & a_{22} & . & . & a_{2{({M - 1})}} & a_{2M} & 0 \\. & . & . & . & . & . & 0 \\. & . & . & . & . & . & 0 \\a_{M1} & a_{M2} & a_{M3} & a_{M.} & a_{M{({M - 1})}} & a_{MM} & 0 \\0 & 0 & 0 & 0 & 0 & 0 & a_{{({M + 1})}{({M + 1})}}\end{bmatrix}$

[0228] The a_((M+1)(M+1)) element indicates the contribution of thehealthy state to the indicator of abnormality for the network node.Since the healthy state should not contribute of the abnormalityindicator, we assigned a_((M+1)(M+1)=0. Therefore for the purpose ofdetecting faults, only the upper block of the matrix A_(upper), isconsidered.

[0229] The elements of the upper block of the operator matrix A_(upper)are obtained as follows: When i≠j, $\begin{matrix}{{A_{upper}\left( {i,j} \right)} = {{\langle{{\psi_{i}(t)},{\psi_{j}(t)}}\rangle}}} \\{= {\frac{1}{T}{{\sum\limits_{t = 1}^{T}\quad {{\psi_{i}(t)}{\psi_{j}(t)}}}}}}\end{matrix}$

[0230] which is the the ensemble average of the two point spatialcross-correlation of the abnormality vectors estimated over a timeinterval T. For i=j we have,${A_{upper}\left( {i,i} \right)} = {1 - {\sum\limits_{j \neq i}\quad {A\left( {i,j} \right)}}}$

[0231] Using this transformation ensures that the maximum eigenvalue ofthe matrix A_(upper) is 1. The entries of the matrix describe how theoperator causes the components of the input abnormality vector to mixwith each other. The matrix A_(upper) is symmetric, real and theelements are non-negative and hence the solution to the characteristicequation:

A _(upper){right arrow over (Φ)}=λ{right arrow over (Φ)}

[0232] consists of orthogonal eigenvectors {{right arrow over(φ)}_(i)}^(M) _(i−1) with eigenvalues {λ_(i)}^(M) _(i−1). Theeigenvectors obtained are normalized to form an orthonormal basis setand we can decompose any given input abnormality vector as:${{\overset{\rightarrow}{\psi}}^{\prime}(t)} = {\sum\limits_{i = 1}^{M}\quad {c_{i}{\overset{\rightarrow}{\varphi}}_{i}}}$

[0233] where {right arrow over (ψ)}¹(t) is the transpose of the vector{right arrow over (ψ)}(t). Incorporating the spatial dependenciesthrough the operator transforms the abnormality vector {right arrow over(ψ)}(t) as:${A_{upper}{{\overset{\rightarrow}{\psi}}^{\prime}(t)}} = {\sum\limits_{i = 1}^{M}\quad {c_{i}\lambda_{i}{\overset{\rightarrow}{\varphi}}_{i}}}$

[0234] Here c_(i) measures the degree to which a given abnormalityvector falls along the ith eigenvector. This value c, can be interpretedas a probability amplitude and c₁ ² as the probability of being in theith eigenstate.

[0235] A subset of the eigenvectors {{right arrow over (φ)}_(i)}^(M)_(i−1) where R≦M is called the fault vector set and can be used todefine a faulty region. The fault vectors are chosen based on themagnitude of the components of the eigenvector. The eigenvector that hasthe components [1 1 1] is identified as the most faulty vector since itcorresponds to maximum abnormality in all its components as defined inour fault model. In the fault model, high abnormality means abruptchanges as measured by the individual MIB sensors, and the [1 1 1]vector signifies the correlation of these variable level changes.

[0236] If a given input abnormality vector can be completely expressedas a linear combination of the fault vectors;${{\overset{\rightarrow}{\psi}}^{\prime}(t)} = {\sum\limits_{r = i}^{R}\quad {c_{r}{\overset{\rightarrow}{\varphi}}_{r}}}$

[0237] then we say that the abnormality vector falls in the faultdomain. The extent to which any given abnormality vector lies in thefault domain can be obtained in the following manner: Since any generalabnormality vector {right arrow over (ψ)}(t) is normalized, thefollowing condition is present,${\sum\limits_{i = 1}^{M}\quad c_{i}^{2}} = 1$

[0238] As there are M different values for c_(i), an average scalarmeasure of the transformation in the input abnormality vector isobtained by using the quadratic functional,

ƒ({right arrow over (ψ)}(t))={right arrow over (ψ)}(t)A{right arrow over(ψ)}¹(t).

[0239] The properties of this functional are described in the followingsection. Using the above equation and the Kronecker delta, we have:${{\overset{\rightarrow}{\psi}(t)}A\quad {\overset{\rightarrow}{\psi}(t)}} = {{\sum\limits_{i = 1}^{M}\quad {c_{i}^{2}\lambda_{r}}}\quad = {E(\lambda)}}$

[0240] The measure E(λ) is the indicator of the average abnormality inthe network as perceived by the node. Now consider an input abnormalityvector in the fault domain. Hence, we obtain a bound for E(λ) as:${\min\limits_{r \in R}\left( \lambda_{r} \right)} \leq {E(\lambda)} \leq {\max\limits_{r \in R}\left( \lambda_{r} \right)}$

[0241] where λ_(r) are the eigenvalues corresponding to the set of Rfault vectors. Thus using these bounds on the functional ƒ({right arrowover (ψ)}(t)) an alarm is declared when${E(\lambda)} > {\min\limits_{r \in R}\left( \lambda_{r} \right)}$

[0242] The maximum eigenvalue of A_(upper) is 1, and it is by designassociated with the most faulty eigenvector. In the followingdiscussion, min_(rεR)(λ_(r))=λ_(ƒmin) and max_(rεR)(λ_(r))=λ_(ƒmax).

[0243] Properties of the Quadratic Functional

[0244] Consider the case of M=3. We have the operator matrix A and theinput abnormality vector as shown: $A = \begin{bmatrix}a_{11} & a_{12} & a_{13} & 0 \\a_{21} & a_{22} & a_{23} & 0 \\a_{31} & a_{32} & a_{33} & 0 \\0 & 0 & 0 & a_{44}\end{bmatrix}$${\overset{\rightarrow}{\psi}(t)} = {\alpha \left\lfloor \begin{matrix}{\psi_{1}(t)} & {\psi_{2}(t)} & {\psi_{3}(t)} & \left. {\psi_{0}(t)} \right\rbrack\end{matrix} \right.}$

[0245] Here |a_(ij)|≦1 for all i and j and α is the normalizationconstant. As discussed in the previous section, since there is nointeraction between the abnormal and normal states, only the upper blockof the operator matrix is considered. Hence:$A_{upper} = \begin{bmatrix}{1 - a_{12} - a_{13}} & a_{12} & a_{13} \\a_{21} & {1 - a_{21} - a_{23}} & a_{23} \\a_{31} & a_{32} & {1 - a_{31} - a_{32}}\end{bmatrix}$

[0246] A few examples will be presented to demonstrate the properties ofthe functional ƒ({right arrow over (ψ)}(t)). In the event of a fault(extreme case), according to the present fault model, correlated changesoccur in the abnormality indicators. These changes would result in afault vector of the following form:

{right arrow over (ψ)}(t)=α[1110]

[0247] Then we have,${A_{upper}{{\overset{\rightarrow}{\psi}}^{\prime}(t)}} = {\alpha \begin{bmatrix}1 \\1 \\1\end{bmatrix}}$

[0248] The quadratic functional ƒ({right arrow over (ψ)}(t))={rightarrow over (ψ)}(t)A{right arrow over (ψ)}¹(t). becomes,

{right arrow over (ψ)}(t)A{right arrow over (ψ)}=α ²3.

[0249] By normalization, α=1/3^(−1/2), therefore ƒ({right arrow over(ψ)}(t))=1. Note that in this case, the magnitude of the fault vectorand the value of the functional are the same.

[0250] Now consider the case in which a random uncorrelated changeoccurs in only one of the abnormality indicators. In this case the inputabnormality vector would be,

{right arrow over (ψ)}(t)=1/3^(−1/2)[1 0 0 2^(−1/2])

[0251] The fourth component of this vector contains the normal componentwhich is required to normalize the input abnormality vector. Now we have${A_{upper}{{\overset{\rightarrow}{\psi}}^{\prime}(t)}} = {\frac{1}{\sqrt{3}}\begin{bmatrix}a_{11} \\a_{21} \\a_{31}\end{bmatrix}}$${f\left( {{\overset{\rightarrow}{\psi}}^{\prime}(t)} \right)} = \begin{matrix}{\frac{a_{11}}{3}\quad} \\{{< \frac{1}{3}}\quad} \\{{{\overset{\rightarrow}{\psi}(t)} \cdot {\overset{\rightarrow}{\psi}(t)}}}\end{matrix}$

[0252] Note α₁₁=1−α12−α13. Hence, in the event of an uncorrelated randomchange, the value of the functional is much smaller than the magnitudeof the input vector.

[0253] Therefore using the functional ƒ({right arrow over (ψ)}(t)) weobtain a scalar quantity with the following properties:

[0254] (1) The value of the functional ranges from 0 to 1.

[0255] (2) In the event of correlated changes the value of thefunctional goes to 1.

[0256] (3) In the event of random uncorrelated changes the functionalhas a value much smaller than 1.

[0257] Thus the quadratic functional has the required properties toidentify faults as described by our model by enhancing the correlatedchanges and deemphasizing the uncorrelated changes associated with thenormal functions of the network.

[0258] Operator for the Network Level Agent: A_(ip)

[0259] In order to design an operator for the network level agent weassume that the correlation under normal situations indicate thecorrelation at fault times as well. Therefore we can use the correlationmatrix to design the operator. At the router three variables (viz) ipIR,ipIDe, and ipOR are considered. Including the normal probability, a 1×4input vector was required:

{right arrow over (ψ)}_(ip)(t)=α_(R)[ψ_(IR)(t)ψ_(IDe)(t)ψ_(OR)(t)ψ_(ip)_(normal) (t)].

[0260] The input vector corresponding to a completely faulty state is{right arrow over (ψ)}=α_(R)[1 1 1 0]

[0261] The fourth component is 0₁ since the system is completely faulty.Using this vector the normalization constant α_(R) for the router wascalculated to be ⅓^(−1/2).

[0262] The appropriate operator matrix A_(ip) will be 4×4. Taking thenormal state to be un coupled to the abnormal states we get a blockdiagonal matrix with a 3×3 upper block A_(ipupper) and a 1×1 lowerblock: $A_{ip} = \begin{bmatrix}a_{11} & a_{12} & a_{13} & 0 \\a_{21} & a_{22} & a_{23} & 0 \\a_{31} & a_{32} & a_{33} & 0 \\0 & 0 & 0 & a_{44}\end{bmatrix}$

[0263] The α₄₄ element indicates the contribution of the healthy stateto the indicator of abnormality for the network node (E[λ]). Since thehealthy state should not contribute to the abnormality indicator, weassigned α₄₄=0. The elements a_(mn) of A_(ipupper) are estimated basedon the spatial correlation between the abnormality indicators. Thecoupling for the ipIR variable with ipOR and ipIDe variables (a₁₂ anda₁₃) are estimated as 0.08 and 0.05, respectively This weak correlationcan be explained because the majority of packets received by the routerare forwarded at the ip layer and not sent to the higher layers. Thecoupling between ipIDe and ipOR (a₂₃) is significantly higher since bothvariables relate to router processing which is performed at the higherlayers. By symmetry: a₂₁=a₁₂, a₃₁=₁₃ and a₂₃−=a₃₂. The main diagonalterms are assigned such that the rows and columns sum to 1. Thus,A_(ipupper) matrix becomes: $A_{{ip}_{upper}} = \begin{bmatrix}0.87 & 0.08 & 0.05 \\0.08 & 0.6 & 0.32 \\0.05 & 0.32 & 0.63\end{bmatrix}$

[0264] The elements of the matrix are calculated according the aboveequations and using an 8 hour data trace from the campus network. (Thevalues obtained for the enterprise network data were the same as thosefor the campus network). Note, that the lower block does not affect theindicator of network abnormality. Hence the computation only uses theupper block. Therefore, the above equation becomes:

E[λ]={right arrow over (ψ)} _(upper)(t)A_(ip) _(upper) {right arrow over(ψ)}_(upper)(t)

[0265] The eigenvalues of the upper block matrix are A_(ipupper) areλ₁=0.2937, λ₂=0.8063, and λ₃=1. The corresponding eigenvectors are{right arrow over (φ)}₁=[−0.0414 0.7169 −0.6855]. {right arrow over (φ)}₂.=[0.8154 −0.3718 −0.4436], and {right arrow over (φ)}₃.=[0.5774 0.57740.5774]. The fourth eigenvector, which is not shown is {right arrow over(φ)} ₄.=[0 0 0 1] with eigenvalue λ₄=0. The portion of the sphere shownin the first sector of the three dimensional space in FIG. 51 representsthe problem domain. This is because the input variables to the fusioncenter range from 0 to 1. The eigenvector ₃. corresponds to the totalfault vector (all components abnormal) and is present at the center ofthe problem domain. Eigenvectors ₁. and ₂. are necessarily outside theproblem domain since they must be orthogonal to ₃. Thus in the presentproblem, unlike in Quantum Mechanics, two of the eigenvectors areoutside the problem domain: however projections of the input abnormalityvector onto ₁ and ₂ are allowed. The eigenvectors ₂ and ₃ are used todefine the faulty region of the space. The vector ₂ is chosen since ithas the highest value in the first component. This component representsthe I . pIR abnormality indicator. Since the system studied is a router,the ipIR variable samples the majority of the traffic passing throughthe router.

[0266] A fault is declared when E[λ] falls between λ₂=0.8063 and λ₃=1.Note that input vectors which are not composed exclusively by {rightarrow over (φ)}₂ and/or {right arrow over (φ)}₃ could still yield anE[λ]>λ₂, but these vectors would necessarily have large projections on{right arrow over (φ)}₂ and/or {right arrow over (φ)}₃. The abnormalregion is defined as: E[λ]={right arrow over (ψ)} _(upper)(t)A _(ip)_(upper) {right arrow over (ψ)}_(upper)(t)

[0267]FIG. 52 shows the range of the average abnormality in the systemby the variation in color. When all the components of the inputabnormality vector {right arrow over (ψ)}(t) (viz, ψ_(IR)(t), ψ_(IDe)and ψ_(OR)(t)), and are 1, ((i.e.) for maximum correlation ofabnormality indicators), the average abnormality corresponds to themaximum eigenvalue 1. This maximum value is depicted by the dark redcolor. Note that as the values of the abnormality indicators decrease intheir correlations and/or magnitude the red hue decreases.

[0268] Operator for the Interface Level Agent: A_(if)

[0269] At the interface we consider two variables (viz) ifIO, and ifOO.Therefore, including the normal state, the input vector is 1×3.

{right arrow over (ψ)}_(if)(t)=α₁[ψ_(IO)(t)ψ_(OO)(t)ψ_(if) _(normal)(t)]

[0270] The input vector that corresponds to the maximum abnormality is{right arrow over (ψ)}_(if)(t)=α₁[1 1 0]. Therefore the normalizationconstant α_(I) for the interface agent is operator matrix Aif isdesigned as explained in the case of a router but now, we have a 3×3matrix. $A_{if} = \begin{bmatrix}0.99 & 0.01 & 0 \\0.01 & 0.99 & 0 \\0 & 0 & 0\end{bmatrix}$

[0271] The elements of the operator matrix have been estimated in amanner analogous to the method used for A_(ip). However the twovariables considered here are not highly coupled since they correspondto the number of octets that come into and go out of a particularinterface. The eigenvalues of the upper block matrix A_(ifupper) areλ₁=0.98, and λ₂=1. The corresponding eigenvectors of the upper block are{right arrow over (φ)}₁[0.7071 −0.7071], and {right arrow over(φ)}₂=[0.7071 0.7071]. The third eigenvector is {right arrow over(φ)}₃=[0 0 1] with eigenvalue λ₃=0. The sector shown in the firstquadrant of the two dimensional space in FIG. 53 is the problem domainand the fault vectors are {right arrow over (φ)}₁ and {right arrow over(φ)}₂. The corresponding abnormality domain equation is:

λ₁ t<E[λ]≦λ ₂

abnormal region

[0272] In FIG. 54, the average abnormality values for the entire problemdomain for the if layer are shown. When both the input components of theabnormality vector are 1 we have a maximum for the average abnormalityindicator.

[0273] Combining Severity and Persistence of Alarms

[0274] It is observed that prior to fault situations the averageabnormality indicator or the correlated abrupt changes exhibited apersistent abnormal behavior. On the contrary, at no fault situations,there is a lack of persistence. Persistence is defined as, given aninstance of high average abnormality or alarm condition, a secondinstance of an alarm occurs within a specified interval of (τ−1) lags.This persistence behavior can be taken advantage of to declare alarmscorresponding to network fault situations. By incorporating persistence,we a-re able to significantly reduce the number of false alarms. As seenfrom the FIG. 55, there exists a persistence in the alarms just prior tothe fault situation denoted by the asterisks. However in FIG. 56 thealarms obtained are not persistent and there was no fault situationrecorded at this time. Note, that the router health does show somepotential alarms due to the correlated changes in the traffic patternsacross the different MIB variables. However, the correlated change intraffic patterns do not persist for more than a single instant. Thus byincorporating persistence a large number of false alarms can befiltered.

[0275] Experimental Results

[0276] Initially, the issues involved in the data collection process arediscussed. Analytical and experimental results on the impact of the datacollection processes on the performance of the network is provided. Fourcase studies of faults detected by the agent on two different networksis provided: one from a campus LAN network and three from an enterprisenetwork.

[0277] Data Collection

[0278] Preliminary studies on the data collection mechanism have beendone at Renselaer Polytechnic Institute (RPI). The impact of the datacollection mechanism on two important aspects of the network, CPUutilization and network load were evaluated. This is a crucial step toensure that the monitoring of the network is done in an unobstrusivemanner. The experimental results are compared with analytic results. Itis shown that the analytic results provide an upper bound and can besafely used to conservatively estimate the impact of the data collectionon the CPU in any generic environment. The experimental set up and thedetails of the results are presented.

[0279] Fxperimental Setup

[0280] The data collection was performed on a local network 200 (shownin FIG. 57) at the Networks Lab at RPI. The SNMP daemon was installed onthe internal router (Poisson in FIG. 57) in the lab. Poisson 17 is a SunUltra SPARC station running Solaris. The data collection mechanismconsists of software which runs on another machine 19 (Erlang in FIG.57) and queries the MIB database at regular intervals of τ seconds. Thequery is done using the “snmget” function that is provided along withthe SNMP manager software. The experiment was run for polling intervalsof τ=1, 10, 15, 30, and 60 s. Each experiment was run for durations of2400 s (50 min) and 7200 s (2 hrs) for each polling interval τ.

[0281] CPU Utilization

[0282] One of the most important concerns in querying a database at arouter is the impact on the router's CPU. For a generic machine the CPUutilization can be computed using the below equation.

CPU utilization=n*d/T

[0283] where n=number of agents polled, d=max{d_(i)} where d_(i)=timerequired to process the required request/response for the ith agent, andT=polling interval in seconds. The analytical results were evaluatedusing n=1, since only one agent is polled. The results are tabulated inFIG. 58. Note: The value of d was experimentally determined to be 0.1125s. This was the maximum time taken by the CPU to process one query onthe single agent at which the data was collected. Using the maximumvalue of d provides a conservative bound on the CPU utilization.

[0284] The experimental results are tabulated in FIG. 59. The CPUutilization was obtained using the “Ps” command on the UNIX. The averageCPU utilization per second and the average CPU utilization per requestare also tabulated. The CPU utilization for the different pollingintervals is shown in FIG. 60. It is observed that page faults played arole in the performance. Although the average CPU utilization/s tends togo down as the polling interval gets longer, the average CPUutilization/request goes up, since the longer the interval the longer isthe setup time to get up the daemon back into memory. Since 10 and 15seconds are rather dose to one another we see very dose results and theyare near the gap between frequently paging and mostly paging. This isalso due to the fact that only one second resolution is present. It isassumed that almost never paging generates an average CPU utilization of0.154 s and always paging generates an average CPU utilization of 0.0750s. It is seen that at a 10 second interval paging is performed about 43%of the time and at a 15 second interval paging is performed about 86% ofthe time. Thus, in all the cases, the analytic values upper bound theexperimental results.

[0285] Network Load

[0286] The network utilization can be computed using the followingequation:

Network load=(RQ+RS)*8/T

[0287] where RQ=size of a request in bytes, RS=size of a response inbytes, and T=polling interval in seconds. The values used in thecomputation of network load are RQ=849 bytes and RS=946 bytes. Thevalues of RQ and RS were experimentally obtained using the application“tcpdump-e”. Here all the request messages were 849 bytes and allresponse messages were 946 bytes. Unlike the bounding results obtainedin the case of CPU utilization, the results for network load are exact.

[0288] Summary on Data Collection

[0289] From the experiments conducted and the analysis performed thefollowing conclusions are made:

[0290] 1. The analytical results provide an upper bound on the CPUutilization.

[0291] 2. The load on the network is very minimal at polling intervalsof 10 or more seconds.

[0292] 3. The average CPU utilization is approximately 1% or less.

[0293] All these above observations provide sound justification that thedata collection mechanism will not seriously impact network performance.

[0294] Field Testing of the Agent

[0295] The intelligent agent has been tested on two different productionnetworks: (1) a campus network and (2) an enterprise network. The twonetworks differ significantly in terms of their traffic patterns andalso the topology and size of their network. In this section thecharacteristics of each of these networks are described.

[0296] Campus LAN Network

[0297] The experiments were conducted on the Local Area Network (LAN) ofthe Computer Science (CS) Department at Rensselaer PolytechnicInstitute. The network topology is as shown in FIG. 62. The CS networkforms one subnet of the main campus network. The network implements theIEEE 802.3 standard. Within the CS network there are seven smallersubnets 7 a-7 g and two routers 1 a, 1 b. All of the subnets 7 a-7 g usesome form of CSMA (Caxrier Sense Multiple Access) for transmission. Therouters 1 a, 1 b implement a version of the Dijkstra's algorithm. Onerouter (shown as router 1 b in FIG. 62) is used for internal routing andthe other serves mainly as a gateway (shown as router 1 a) to the campusbackbone. The external router or gateway also provides some limitedamount of internal routing. These syslog messages were used to identifynetwork problems. One of the most common network problems was NFS servernot responding. Possible reasons for this problem are unavailability ofnetwork path or that the server was down. The syslog messages onlyreported that the file server was not responding after the server hadcrashed. Although not all problems could be associated with syslogmessages, those problems which were identified by syslog messages wereaccurately correlated with fault incidents.

[0298] Enterprise Network

[0299] The topology of the enterprise network 300 is as shown in FIG.63. This network 300 was significantly larger than the campus network.Each individual subnet was connected by the internal router 16 whichalso hosts an SNMP agent. Data was collected from the interface ofsubnet 26 and subnet 21 with the internal router and at the routeritself. The existing network management scheme consisted of a troubleticketing system which contained problem descriptions as reported by theend users. Syslog messages were also reported.

[0300] Implementation Specifications

[0301] The parameters of the algorithm that are obtained for this designare:

[0302] p: the order of the AR process

[0303] N_(L) and N_(τ): learning and test window sizes

[0304] A_(ip) and A_(if): operator matrices for the ip and if levelagents.

[0305] τ: the persistence time.

[0306] The parameter obtained through online learning are:

[0307] α₁: the AR parameter.

[0308] Case Studies of Typical Faults

[0309] In this section one specific fault of the different types offaults observed in the two networks are described.

[0310] Case Study (1): File Server Failures

[0311] In this case study a fault scenario corresponding to a fileserver failure on subnet 2 of the campus network is described. This caserepresents a predictable network problem where the traffic related MIBvariables show signs of abnormality before the occurrence of thefailure. 12 machines on subnet 2 and 24 machines outside subnet 2reported the problem via syslog messages. The duration of the fault wasfrom 11:10 am to 11:17 am (7 mins) on Dec. 5, 1995 as determined by thesyslog messages. The cause of the fault was confirmed to be excessivenumber of ftp requests to the specific file server. FIGS. 64 through 67show the output of the intelligent agent at the router and at the iplayer variable level. Note that there is a drop in the mean level of thetraffic in the ipIR variable prior to the fault. The indicators providethe trends in abnormality. The fault period is shown by the verticaldotted lines. In FIG. 64 for router health, the ‘x’ denotes the alarmsthat correspond to input vectors that are faulty. Note that there arevery few such alarms at the router level. The fault was predicted 21mins before the crash occurred. The mean time between false alarms inthis case was found to be 1032 mins (approx 17 hrs). The persistence inthe abnormal behavior of the router is also captured by the indicator.The on-off nature of the ipIDE and ipOR indicators was attributed to theless bursty behavior of those variables. The alarms generated at theinterface level along with the variable-level abnormality indicators areshown in FIGS. 68 through 70. In both the if level variables we observea significant drop in the mean traffic prior to the fault. The fault waspredicted 27 mins before the file server crashed and the mean timebetween false alarms was 100 mins (approx 1.5 hrs). The bursty behaviorof both the if variables results in an excessive number of false alarmsgenerated at the output of the if agent. The fault was first predictedat the interface level (about 6 mins) prior to the router level. Thealarms obtained approximately an hour and a half before the fault couldalso be associated with the same fault but there is no way to confirm.Thus the results obtained at the if agent can be used to confirm thealarms declared at the ip agent. Note, also that the subnet showsabnormal behavior soon after the fault. This was attributed to thehysteresis of the fault. In the present scheme, no measures are taken tocombat this effect.

[0312] Case Study (2): Protocol Implementation Errors

[0313] This fault case is one where the fault is not predictable but thesymptoms of the fault can be observed. One of the faults detected on theenterprise network was a super server inetd protocol error. The superserver is the server that listens for incoming requests for variousnetwork servers thus serving as a single daemon that handles all serverrequests from the clients. The existence of the fault was confirmed bysyslog messages and trouble tickets. The syslog messages reported theinetd error. In addition to the inetd error other faulty daemon processmessages were also reported during this time. Presumably these faultydaemon messages are related to the super server protocol error. Thetrouble tickets also reported problems at the time of the super serverprotocol error. These problems were the inability to connect to the webserver, send mail, print on the network printer and also difficulty inlogging onto the network. The super server protocol problem is ofconsiderable interest since it affected the overall performance of thenetwork for an extended period of time. The detection scheme performedwell on this type of error. FIGS. 71 through 74 show the alarmsgenerated at the router level. The prediction time (with respect to thesyslog messages) was 15 mins with respect to the existing managementschemes. The existing trouble ticketing scheme only responds to thefault situation and there is no adaptive learning capability. There wereno false alarms reported in this data set. Persistent alarms wereobserved just before the fault. FIGS. 75 through 77 show the alarmsgenerated at the subnet level (subnet 21), The prediction time was 32mins. There was hysteresis effect observed soon after the fault. Themean time between false alarms was 116 mins. The alarms at the subnetoccur in advance of those observed at the router suggesting a possibleproblem resolution to the subnet level. The fault may be presumed tohave originated at the subnet and then propagated through the network.The origin of the fault in this case is the location of the superserver, which we may infer based on the alarm sequences obtained to havebeen located on the subnet being monitored. This inference was confirmedto be true by consulting with the system administrator. The propagationthrough the network is the consequence of more and more clients tryingto access applications that depend on the super server to

[0314] Case Study (3): Network Access Problems

[0315] Network access problems are predictable. These problems werereported primarily in the trouble tickets. These faults were often notreported by the syslog messages. Due to the inherent reactive nature oftrouble tickets, it is hard to determine the exact time when the problemoccurred. The trouble reports received ranged from the network beingslow to the inaccessibility of an entire network domain. FIGS. 78through 81 show the alarms obtained at the router level. The predictiontime was 6 mins. The mean time between false alarms was 286 mins. FIGS.82 through 84 show the alarms obtained at the subnet 26 of the router.In this case the alarms were obtained 12 mins after the fault report wasreceived. The mean time between false alarms was 269 mins.

[0316] Case Study (4): Runaway Processes

[0317] A runaway process is an example of high network utilization bysome culprit user that affects network availability to other users onthe network. Runaway process is an example of an unpredictable fault butwhose symptoms can be used to detect an impending failure. This is acommonly occurring problem in most computation oriented networkenvironments. Runaway processes are known to be a security risk to thenetwork. This faulty was reported by the trouble tickets but much afterthe network had run out of the process identification numbers. In spiteof having a large number of syslog messages generated during this periodthere was no clear indicator that a problem had occurred. FIGS. 85through 88 show the performance of the agent in the detection of therunaway process. The prediction time was 1 min and the mean time betweenfalse alarms was 235 mins. FIGS. 89 through 91 show the alarms obtainedat subnet 26 of the router. The alarms were obtained at the same time aswhen the system reported a lack of process identification numbers. Themean time between false alarms was 433 mins.

[0318] Summary of Experiments

[0319] Thus far the agent has been successful in identifying fourdifferent types of faults, file server failures, network accessproblems, runaway processes and a protocol implementation error. Theagent detected/predicted 8/9 file server failures on the campus networkand 15 file server failures on the enterprise network. It alsodetected/predicted 8 instances of network access problems, 1 protocolimplementation error and 1 instance of runaway process on the enterprisenetwork. In all these cases the effects of the faults were observed inthe chosen traffic-related MIB variables. Also, the changes associatedwith these fault events occurred in a correlated fashion, thus resultingin their detection by the agent.

[0320] Performance of the Intelligent Agent and Composite Results

[0321] The performance of an online detection/prediction scheme ismeasured in terms of the mean time between false alarms, and the meanprediction time. Here, these metrics are described and are tabulated forthe intelligent agent. The complexity for the algorithm is providedalong with an implementation flow chart. Composite results obtained forthe different types of faults predicted/detected both on the campus andthe enterprise network are provided. A discussion on the limitations ofthis approach and the occurrence of false alarms is included.

[0322] Performance Measures for the Agent

[0323] The performance of the algorithm is expressed in terms of theprediction time T_(p), and the mean time false alarms T_(f). Predictiontime is the time to the fault from the nearest alarm proceeding it. Atrue fault prediction is identified by a fault declaration which iscorrelated with an accurate fault label from an independent source suchas syslog messages and/or trouble tickets. Therefore, fault predictionimplies two situations; (a) in the case of predictable faults such asfile server failures and network access problems, true prediction ispossible by observing the abnormalities in the MIB data and, (b) in thecase of unpredictable faults such as protocol implementation errors,early detection is possible as compared to the existing mechanisms suchas syslog messages and trouble reports. Any fault declaration which didnot coincide with a label was declared a false alarm. The quantitiesused in studying the performance of the agent are depicted in FIG. 92. τis the number of lags used to incorporate the persistence criteria inorder to declare alarms corresponding to fault situations. In some casesalarms are obtained only after the fault has occurred. In theseinstances, we only detect the problem. The time for the detection T_(d)is measured as the time elapsed between the occurrence of the fault andthe declaration of the alarm. There are some instances where alarms wereobtained both preceding and after the fault. The alarms that follow thefault in these cases are attributed to the hysteresis effect of thefault.

[0324] The mean time between false alarms provided an indication of theperformance of the algorithm. For a router in the campus network theaverage number of alarms obtained was 1 alarm per 24 hrs and in theenterprise network there were 4 alarms per 24 hrs. The averageprediction time for both the campus and the enterprise network was 26mins.

[0325] Composite Results and the Capability of the Agent

[0326] Campus Network Data

[0327] The only type of failure observed in this network were fileserver failures.

[0328] File Server Failures

[0329] The composite results for the alarms obtained from the internalrouter in the case of file server failures are complied in FIG. 93. Theaverage prediction time with a persistence criteria of r=3 was 26 minswhich is much less than half the mean time between false alarms, 455mins (approx. 7.5 hrs). The time scale of prediction is large enough toallow time for potential corrective measures. Eight out of nine faultsare predicted.

[0330] In data set 3, fault was reported by only two machines on thesame subnet on which the faulty file server was located. This suggeststhat for this fault there was minimal impact on the ip level traffic.Furthermore, the fault occurred in the early morning hours (1.23 am-1:25am). All these reasons contributed to the fault not being predicted.However, for this fault case, an alarm approximately 93 mins prior tofault was observed. This could very well be due to the increase intraffic caused by the daily backup on the system which occurs aroundmidnight. Therefore, it is concluded that in this case where the faultwas localized within the subnet and did not affect the router variables.Both faults in subnet 3 were predicted since they affected the routervariables. This is corroborated by the fact that machines on both subnet2 and subnet 4 reported the fault.

[0331] The results for the ifagent in the case of file server failureson the campus network are tabulated in FIG. 94. The if agent did notperform as well as the ip agent. This is due to the bursty nature ofboth the iflevel variables. The mean prediction time T_(p) was 72 minsand the mean detection time was 28 mins. The mean time between falsealarms was 304 mins (approx. 5 hrs.). Only 2 out of the nine faults werepredicted. Three others were detected. Fault 2 in data set 3 could nothave been predicted or detected since only 2 machines on the same subnetas the faulty server reported the problem. Thus, the fault could nothave affected the Ifof the ip variables. Despite the lack of informationfrom the if variables of subnet 3 (data set 6) the system algorithm wasable to detect one of the two faults on the subnet. Therefore havingdata from all interfaces will improve prediction.

[0332] The system algorithm was capable of detecting faults thatoccurred at different times of the day. Regardless of the number ofmachines that are affected outside the subnet, the agent is able topredict the problem as long as there is sufficient traffic that affectsthe network layer (ip) and the interface if level variables.

[0333] Enterprise Network Data

[0334] On the enterprise network, three different types of faults wereencountered. One accept protocol implementation error on a super server,one runaway process and 15 file server failures.

[0335] File Server Failures

[0336] The composite results for the detection of file server failuresobtained at the router level on the enterprise network are tabulated inFIG. 95. Note that unlike the campus network majority of the file serverfailure were not detected at the router. The inability of the routerlevel traffic to detect simple file server failures is attributed to thepresence of switched that contain the traffic within a particularsubnet. Only when the failure affects machines outside the subnet underconsideration will be detected by the router level indicators. Thedetection results obtained at the interface level have been tabulated inFIG. 95. It is observed that almost all the file server failures werepredicted at the interface level. The traffic at the interface levelprovided indicators related to faults local to a given subnet. Thus,having traffic data from multiple interfaces will help to isolate theproblem to a subnet level.

[0337] Network Access Problems

[0338] The alarms obtained under this category of network problems areindicative of performance problems. The abnormality indicator obtainedin this scenario can also be interpreted as a QoS measure for thenetwork in the absence of drastic network failures. The detectionresults for network access failures are tabulated in FIG. 97. Thedetection results at the interface level are shown in FIG. 98. It wasfound that both the router level and subnet level indicators werecapable of detecting network access problems. In some cases, only one ofthe indicators was capable of indicating the existence of a problem.This example also suggests the need to have both the router and subnetlevel information for comprehensive management.

[0339] Protocol Implementation Error

[0340] There was only one protocol implementation error that wasobserved and the results obtained for both the router and the subnet areprovided in FIG. 99. This type of failure can in general be consideredas a software implementation error.

[0341] Runaway Process

[0342] One occurrence of a runaway process was also detected by theagent and the results are tabulated in FIG. 100. The detection obtainedat the subnet level coincided with label of the fault as can be seen inthe Figures of case study 3.

[0343] Flow Chart for the Implementation of the Algorithm

[0344] As shown in FIG. 101, a flow chart to describe the algorithm usedto obtain the average abnormality indicator by both the if and the ipagent is provided. The process starts at step S1. Next, at step S2, theMIB data is polled. Then, at step S3, the variable level abnormalityindicators arc generated. These indicators are next evaluated at stepS4. If the alarms thus obtained satisfy the persistence criteria at stepS5, then a fault situation is declared at step S6. If not, then theprocess starts over again at step S2.

[0345] Complexity of the Agent Algorithm

[0346] The detection scheme for the agent is based on a linear model,rendering it feasible for online implementation. The complexity of thedetection scheme as a function of the number of model parameters isO(M), where M is the number of input MIB variables. The four modelparameters for each MIB variable are the mean and variance for theresidual signals, the learning window and the test window sizes. Theorder of complexity increase linearly, and thus the method is scalableto a large number of nodes. For a given router with K interfaces the iplevel agent requires 12 model parameters and the if level agent requires8 parameters per interface. Thus, making the total number of modelparameters for the router 8K+12. Therefore, the agent is of sufficientlylow order of complexity to enable its implementation on wide arearouters.

[0347] A Discussion on False Alarms

[0348] Not all false alarms encountered in the present system can bepositively identified as false alarms due to the inadequate methodsavailable to confirm fault situations. The two labeling schemes used toconfirm alarms as correlated with fault events are the syslog messagesand the trouble tickets. Syslog messages are only sent in response to aparticular fault situation such as when a user or a process accesses afaulty server. In the event when there are no users accessing the systemthere are no relevant syslog messages sent, and for this reason thefault situation may not be observed in the syslog messages. So, althougha fault situation may exist, and the system algorithm is detecting thissituation, since no corroborating syslog messages exist, the veracity ofthe alarm cannot be determined. Alarms of this kind are counted asfalse. The trouble tickets are emails that are sent by users on thenetwork in response to some difficulty encountered on the network. Thesemessages suffer from the lack of accuracy in the problem report and arereactive. The inaccuracy causes certain predictive alarms to be declaredas false. Reactive implies that the alarms were received in response toan already existing fault situation.

[0349] There are several known sources that give rise to false alarmsthat are system specific. Such false alarms can be avoided by finetuning the algorithm to a specific network. One such common false alarmis system backup which occurs at a set time for a given network. Forexample in the campus network, at system backup time, a large change isgenerated abruptly in a correlated fashion at the subnet level. Thisresults in a detection by the agent although no fault exist. Thisproblem can be alleviated if the system backup time is known. Once anetwork fault occurs the network required time to return to normalfunctioning. This period is also detected as correlated change points,although they do not necessarily correspond to a fault. Alarms that aregenerated at these time can be avoided by allowing a renewal timeimmediately after a fault has been detected. Thus the addition ofhystersis will help reduce the false alarms. It was observed that at theif layer the false alarm rate of the agent is much higher than at the iplayer. This has been attributed to the burstiness in both the if levelvariables. Increasing the order of the AR model may help in reducing thefalse alarm rate but there is a trade off in detection time that needsto be contended with. Preliminary results indicate a lower false alarmrate for the enterprise network over the campus network.

[0350] Summary

[0351] Hence, the present invention provides an online network faultdetection algorithm. This was achieved by designing an intelligentagent. Network faults can be modeled as correlated transient changes inthe traffic-related MIB variables. This model is independent of specificfault descriptions. The network model was elucidated from a few of theknown file server faults observed on one network. The model was found tofit several other file server failures on the same network and also on acompletely different network. The model was also found to be good in thecase of protocol implementation errors. By characterizing network faultbehavior as transient short lived signals, the requirement of accuratetraffic models for normal network behavior was circumvented.

[0352] The fault model developed also provides a first step towards thecharacterization and classification of network faults based on theirstatistical properties. Since network faults are modeled as correlatedtransient abrupt changes, the type of abrupt changes is used todistinguish between the different classes of network faults. Forexample, as shown in FIG. 102, the fault space 400 can be roughlydivided into traffic-related faults 23 and faults related to protocolimplementation errors 21. Within these larger groups based on the typeof abrupt change, the class of AR detectable faults 25 is provided. Bythis we mean that the abrupt changes can be described by the AR model.Furthermore, based on the order of AR required to detect the abruptchanges the class of AR order 1 (AR(1)) 27 is provided. Using thisclassification scheme, it is possible to develop very specific tools todeal with a large class of faults. For example, some faults may only becaptured using higher orders of AR while others may require a smallorder. In each of these cases the polling frequency or the rate ofacquisition of data may differ based on the constraint of havingsufficient number of sample to obtain accurate estimate of the ARparameters. Thus, optionally polling the MIBs will help reduce the totalbandwidth required to do fault management.

[0353] In the case of traffic-related faults, that can be detected at arouter, just three variable were required (ipIR, ipIDe, IPOR). To obtaina finer resolution upto the subnet level required two more variables perinterface (ifIO, ifOO). This choice of variables greatly reduces thedimensionality of the problem without significant compromise in theresolution of network faults.

[0354] Based on the network fault model proposed, a fault detectionscheme is designed. The detection algorithm was developed with thevision to implement it in a distributed framework. This allows theimplementation to be scalable for large networks. The algorithm isimplemented in an online fashion to enable the real-time mechanisms suchas balancing or flow control. Since the trend in abnormality of thenetwork is captured by the agent it allows for confirming the existenceof faulty conditions before recovery is undertaken. Furthermore, theprediction time scale is in the order of minutes and is sufficient timeto perform any further verification before deciding on the course ofrecovery to be implemented.

[0355] While the invention has been described in detail in connectionwith preferred embodiments known at the time, it should be readilyunderstood that the invention is not limited to the disclosedembodiments. Rather, the invention can be modified to incorporate anynumber of variations, alterations, substitutions or equivalentarrangements not heretofore described, but which are commensurate withthe spirit and scope of the invention. Accordingly, the invention is notlimited by the foregoing description or drawings, but is only limited bythe scope of the appended claims. What is claimed as new and desired tobe protected by Letters Patent of the United States is:

1. A method for predictive fault detection in network traffic,comprising the steps of: choosing a set of Management Information Base(MIB) variables related to said fault detection; sensing a change pointobserved in each said MIB variable in said network traffic; generating avariable level alarm corresponding to said change point; and combiningsaid variable level alarm to produce a node level alarm.
 2. The methodof claim 1 wherein said MIB variables are interfaces (if) and InternalProtocols (ip).
 3. The method of claim 2 wherein said interfaces (if)further comprise variables ifIO (In Octets) and ifOO.
 4. The method ofclaim 2 wherein said Internal Protocol (ip) further comprise variablesipIR (In Receives), ipIDE (In Delivers) and ipOR (Out Requests).
 5. Themethod of claim 1 wherein said generating step further comprise the stepof linearly modeling said MIB variables using a first orderauto-regressive (AR) process to generate said variable level alarm. 6.The method of claim 5 further comprising the step of performing asequential hypothesis test utilizing a Generalized Likelihood Ratio(GLR) on said linear model to generate said variable alarm.
 7. Themethod of claim 1 wherein said combining step further comprise the stepof correlating spatial and temporal information from said MIB variables.8. The method of claim 7 wherein said step of correlating is performedutilizing a linear operator.
 9. The method of claim 1 wherein said faultdetection is applied as the definition of Quality of Service (QoS). 10.The method of claim 1 wherein said MIB variables are maintained by anSimple Network Management Protocol (SNMP).
 11. The method of claim 1wherein said network is a local area network.
 12. The method of claim 1wherein said network is a local area network.
 13. The method of claim 1wherein said fault comprise predictable and non-predictable faults. 14.A method for predictive fault detection in a network, comprising thesteps of: generating variable level alarms corresponding to abruptchanges observed in each selected MIB variable; and correlating spatialand temporal information from said MIB variables utilizing a linearoperator to produce a node level alarm.
 15. The method of claim 14wherein said MIB variables are interfaces (if and Internal Protocols(ip).
 16. The method of claim 15 wherein said interfaces (if) furthercomprise variables ifIO (In Octets) and ifOO.
 17. The method of claim 15wherein said Internal Protocol (i) further comprise variables ipIR (InReceives), ipIDE (In Delivers) and ipOR (Out Requests).
 18. The methodof claim 14 wherein said step of generating further comprise the step oflinearly modeling said MIB variables using a first order auto-regressive(AR) process to generate said variable level alarm.
 19. The method ofclaim 18 further comprising the step of performing a sequentialhypothesis test utilizing a Generalized Likelihood Ratio (GLR) on saidlinear model to generate said variable alarm.
 20. The method of claim 14wherein said fault detection is applied in the definition of Quality ofService (QoS).
 21. The method of claim 14 wherein said MIB variables aremaintained by an Simple Network Management Protocol (SNMP).
 22. Themethod of claim 14 wherein said network is a local area network.
 23. Themethod of claim 14 wherein said network is a local area network.
 24. Themethod of claim 14 wherein said fault comprise predictable andnon-predictable faults.
 25. A method for predictive fault detection in anetwork, comprising the steps of: sensing network traffic and generatingvariable level alarms corresponding to changes in said traffic; andcorrelating spatial and temporal information from MIB variables relatedto said fault detection utilizing a linear operator to produce a nodelevel alarm.
 26. The method of claim 25 wherein said MIB variables areinterfaces (if) and Internal Protocols (ip).
 27. The method of claim 26wherein said interfaces (if) further comprise variables ipIO (In Octets)and ifOO.
 28. The method of claim 26 wherein said Internal Protocol (ip)further comprise variables ipIR (In Receives), ipIDE (In Delivers) andipOR (Out Requests).
 29. The method of claim 25 wherein said step ofgenerating further comprise the step of linearly modeling said MIBvariables using a first order auto-regressive (AR) process to generatesaid variable level alarm.
 30. The method of claim 29 further comprisingthe step of performing a sequential hypothesis test utilizing aGeneralized Likelihood Ratio (GLR) on said linear model to generate saidvariable alarm.
 31. The method of claim 25 wherein said fault detectionis applied in the definition of Quality of Service (QoS).
 32. The methodof claim 25 wherein said MIB variables are maintained by an SimpleNetwork Management Protocol (SNMP).
 33. The method of claim 25 whereinsaid network is a local area network.
 34. The method of claim 25 whereinsaid network is a local area network.
 35. The method of claim 25 whereinsaid fault comprise predictable and non-predictable faults.
 36. A systemfor detecting fault in a network traffic, comprising: a data processingunit for choosing a set of Management Information Base (MIB) variablesrelated to said fault detection; a sensor for sensing a change pointobserved in each said MIB variable in said network traffic andgenerating a variable level alarm corresponding to said change point;and a fusion center for combining said variable level alarm to produce anode level alarm.
 37. The system of claim 36 wherein said MIB variablesare interfaces (if) and Internal Protocols (ip).
 38. The system of claim37 wherein said interfaces (if) further comprise variables ifIO (InOctets) and ifOO.
 39. The system of claim 37 wherein said InternalProtocol (ip) further comprise variables ipIR (In Receives), ipIDE (InDelivers) and ipOR (Out Requests).
 40. The system of claim 36 whereinsaid sensor linearly models said MIB variables using a first orderauto-regressive (AR) process to generate said variable level alarm. 41.The system of claim 40 wherein said sensor performs a sequentialhypothesis test utilizing a Generalized Likelihood Ratio (GLR) on saidlinear model to generate said variable alarm.
 42. The system of claim 36wherein said fusion center correlates spatial and temporal informationfrom said MIB variables.
 43. The system of claim 42 wherein saidcorrelating is performed utilizing a linear operator.
 44. The system ofclaim 36 wherein said fault detection is applied in the definition ofQuality of Service (QoS).
 45. The system of claim 36 wherein said MIBvariables are maintained by an Simple Network Management Protocol(SNMP).
 46. The system of claim 36 wherein said network is a local areanetwork.
 47. The system of claim 36 wherein said network is a local areanetwork.
 48. The system of claim 36 wherein said fault comprisepredictable and non-predictable faults.
 49. A system for predictivefault detection in a network comprising: at least one sensor forgenerating variable level alarms corresponding to a change observed in aselected MIB variable; and a fusion center for correlating spatial andtemporal information from said MIB variables utilizing a linear operatorto produce a node level alarm.
 50. The system of claim 49 wherein saidMIB variables are interfaces (if) and Internal Protocols (ip).
 51. Thesystem of claim 50 wherein said interfaces (if) further comprisevariables ifIO (In Octets) and ifOO.
 52. The system of claim 50 whereinsaid Internal Protocol (i) further comprise variables ipIR (InReceives), ipIDE (In Delivers) and ipOR (Out Requests).
 53. The systemof claim 49 wherein said sensor linearly models said MIB variables usinga first order auto-regressive (AR) process to generate said variablelevel alarm.
 54. The system of claim 53 wherein said sensor performs asequential hypothesis test utilizing a Generalized Likelihood Ratio(GLR) on said linear model to generate said variable alarm.
 55. Thesystem of claim 49 wherein said fault detection is applied in thedefinition of Quality of Service (QoS).
 56. The system of claim 49wherein said MIB variables are maintained by an Simple NetworkManagement Protocol (SNMP).
 57. The system of claim 49 wherein saidnetwork is a local area network.
 58. The system of claim 49 wherein saidnetwork is a local area network.
 59. The system of claim 49 wherein saidfault comprise predictable and non-predictable faults.
 60. A system formonitoring network traffic for predictive fault detection, comprising:at least one sensor for generating a variable level alarm correspondingto a change in said traffic; and a fusion center for correlating spatialand temporal information from MIB variables related to said faultdetection utilizing a linear operator to produce a node level alarm. 61.The system of claim 60 wherein said MIB variables are interfaces (if)and Internal Protocols (ip).
 62. The system of claim 61 wherein saidinterfaces (if) further comprise variables ifIO (In Octets) and ifOO.63. The system of claim 61 wherein said Internal Protocol (ip) furthercomprise variables ipIR (In Receives), ipIDE (In Delivers) and ipOR (OutRequests).
 64. The system of claim 60 wherein said sensor linearlymodels said MIB variables using a first order auto-regressive (AR)process to generate said variable level alarm.
 65. The system of claim64 wherein said sensor performs a sequential hypothesis test utilizing aGeneralized Likelihood Ratio (GLR) on said linear model to generate saidvariable alarm.
 66. The system of claim 60 wherein said fault detectionis applied in the definition of Quality of Service (QoS).
 67. The systemof claim 60 wherein said MIB variables are maintained by an SimpleNetwork Management Protocol (SNMP).
 68. The system of claim 60 whereinsaid network is a local area network.
 69. The system of claim 60 whereinsaid network is a local area network.
 70. The system of claim 60 whereinsaid fault comprise predictable and non-predictable faults.